PaliGemma
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.
You can view and download PaliGemma models from the follow sites:
- Download from Kaggle.
- Download from Hugging Face.
There are two categories of PaliGemma models, a general purpose category and a research-oriented category:
- PaliGemma - General purpose pretrained models that can be fine-tuned on a variety of tasks.
- PaliGemma-FT - Research-oriented models that are fine-tuned on specific research datasets.
Key benefits include:
-
Multimodal capability
Simultaneously handles both images and text input. -
Versatile base model
Can be fine-tuned on a wide range of vision-language tasks. -
Off-the-shelf exploration
Comes with a checkpoint fine-tuned on on a mixture of tasks for immediate research use.