PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

There are two sets of PaliGemma models, a general purpose set and a research-oriented set:

  • PaliGemma - General purpose pretrained models that can be fine-tuned on a variety of tasks.
  • PaliGemma-FT - Research-oriented models that are fine-tuned on specific research datasets.

Key benefits include:

  • Simultaneously understands both images and text.
  • Can be fine-tuned on a wide range of vision-language tasks.
  • Comes with a checkpoint fine-tuned on on a mixture of tasks for immediate research use.

Learn more

PaliGemma's model card contains detailed information about the model, implementation information, evaluation information, model usage and limitations, and more.
View more code, Colab notebooks, information, and discussions about PaliGemma on Kaggle.
Run a working example for fine-tuning PaliGemma with JAX in Colab.