PaliGemma prompt and system instructions

This page describes prompt formatting and system instructions for PaliGemma models. These Gemma model variants use the same general formatting as Gemma foundation models, and also support a special syntax for specific image-related tasks.

Prompt format

PaliGemma models use the same prompt formatting as the Gemma foundation models they are based on. However, PaliGemma models also support a special task syntax, which is described in the next section. For more information on Gemma prompt formatting, see Gemma prompt and system instructions.

Image and text data order

When prompting PaliGemma models with text and image data, the image data must always be provided first, and then the text prompting data after it. Reversing the order of image and text prompt data, or mixing image and text data will typically generate unusable responses.

Prompt task syntax

The PaliGemma models are trained with specific prompt patterns and syntax for tasks such as object identification and image captioning. You can use this prompt task syntax to request specific behavior from the PaliGemma models, as follows:

  • "cap {lang}\n": Very raw short caption (from WebLI-alt)
  • "caption {lang}\n": Nice, COCO-like short captions
  • "describe {lang}\n": Somewhat longer, more descriptive captions
  • "ocr": Optical character recognition
  • "answer {lang} {question}\n": Question answering about the image contents
  • "question {lang} {answer}\n": Question generation for a given answer
  • "detect {object} ; {object}\n": Locate listed objects in an image and return the bounding boxes for those objects
  • "segment {object}\n": Locate the area occupied by the object in an image to create an image segmentation for that object

The {lang} options are for language codes. PaliGemma supports language recognition for 34 different languages for task prompts with this option. You can find the list of supported languages on GitHub.

For a detailed code examples showing how to use this syntax, see the Generate PaliGemma output with Keras tutorial.

Batched prompt commands

You can provide more than one prompt command within a single prompt as a batch of instructions. Each prompt command must end with a \n character. The following example demonstrates how to structure your prompt text to provide multiple instructions.

prompts = [
    'answer en where is the cow standing?\n',
    'answer en what color is the cow?\n',
    'describe en\n',
    'detect cow\n',
    'segment cow\n',
]
images = [cow_image, cow_image, cow_image, cow_image, cow_image]
outputs = paligemma.generate(
    inputs={
        "images": images,
        "prompts": prompts,
    }
)
for output in outputs:
    print(output)

System instructions

The PaliGemma models do not support any additional system instructions beyond the Gemma system instructions from the foundation models they are based on.