PaliGemma

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

You can view and download PaliGemma models from the follow sites:

There are two categories of PaliGemma models, a general purpose category and a research-oriented category:

  • PaliGemma - General purpose pretrained models that can be fine-tuned on a variety of tasks.
  • PaliGemma-FT - Research-oriented models that are fine-tuned on specific research datasets.

Key benefits include:

  • Simultaneously handles both images and text input.
  • Can be fine-tuned on a wide range of vision-language tasks.
  • Comes with a checkpoint fine-tuned on on a mixture of tasks for immediate research use.

Learn more

Try detection and content generation capabilities with PaliGemma in Colab.
Fine-tune a PaliGemma model with image data using JAX in Colab.
View more code, Colab notebooks, information, and discussions about PaliGemma on Kaggle.