Gemma release 3 and later can be used to understand and process information from
both images and text. This capability enables it to perform complex tasks that
require a comprehensive understanding of the world.
Specifically, this section explores how you can use visual data for
prompts. Using Gemma to interpret and respond to images, videos, and other
visual inputs, you can unlock powerful new applications, including:
Image Interpretation: Gemma can be instructed to
analyze and understand the content of images.
Content Creation: Incorporating visual data into prompts
allows Gemma to produce more creative and contextually appropriate content.
Visual data
Visual data can come in many formats and levels of resolution. The actual
visual formats you can use with Gemma, such as JPEG and PNG formats, are
determined by the framework you choose to convert visual data into tensors. Here
are some specific considerations for preparing visual data for processing with
Gemma:
Token cost: Each image typically uses 256 tokens. PaliGemma image token
costs vary depending on the model you select.
Resolution: The interpreted resolution for images, meaning the number of
pixels encoded into tokens and interpreted by the model, depends on the
Gemma version you are using:
Gemma 3: (4B and higher) 896x896 resolution, with pan and scan
options for larger images.
Gemma 3n: 256x256, 512x512, or 768x768 resolution
PaliGemma 2: 224x224, 448x448, or 896x896 resolution
Lower-resolution images are typically processed faster at the cost of having
fewer interpretable visual details. If you want to optimize processing speed for
image data, you should aim to provide visual data at one of the interpreted
resolution sizes of the Gemma model you are using.
Do's
Here are some best practices to follow when prompting Gemma with visual data.
Be specific: If you have any specific tasks, provide sufficient context
and guidance. Instead of "describe this image", try "describe the scene in this
image, focusing on the relationship between the people and the objects."
Provide constraints: To achieve a particular style or tone, be sure to
specify it in your prompt. For example, instead of a general story request, ask
Gemma to "Write a short story about this image in the style of a film noir."
Iterative Refinement: Getting the intended output often requires
experimentation and refining the prompts. Begin with a basic prompt and
gradually add complexity.
Don'ts
Here are some things to avoid when prompting Gemma with visual data.
Expect Pixel-Perfect Precision from Gemma: Tasks requiring precise
pixel-level analysis, such as detailed object detection and OCR, are best
handled by dedicated computer vision models. Gemma, for example, cannot
accurately count individual blades of grass in an image, only provide an
approximation.
Vague or Ambiguous Prompts: Instead of general prompts like "Generate
something based on this image", provide specific instructions to achieve
intended outputs. Clearly define what "something" is. For example, a poem,
recipe, or code snippet.
Ignore Model Limitations: Understanding Gemma's limitations is vital for
effective use. Asking it to "Analyze this X-ray image and tell me the patient's
exact medical condition" is a clear example of misuse, potentially leading to
harmful medical misinformation.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-06-25 UTC."],[],[],null,["# Prompt with visual data\n\nGemma release 3 and later can be used to understand and process information from\nboth images and text. This capability enables it to perform complex tasks that\nrequire a comprehensive understanding of the world.\n\nSpecifically, this section explores how you can use visual data for\nprompts. Using Gemma to interpret and respond to images, videos, and other\nvisual inputs, you can unlock powerful new applications, including:\n\n- [Image Interpretation](./image-interpretation): Gemma can be instructed to analyze and understand the content of images.\n- [Content Creation](./content-creation): Incorporating visual data into prompts allows Gemma to produce more creative and contextually appropriate content.\n\nVisual data\n-----------\n\nVisual data can come in many formats and levels of resolution. The actual\nvisual formats you can use with Gemma, such as JPEG and PNG formats, are\ndetermined by the framework you choose to convert visual data into tensors. Here\nare some specific considerations for preparing visual data for processing with\nGemma:\n\n- **Token cost:** Each image typically uses 256 tokens. PaliGemma image token costs vary depending on the model you select.\n- **Resolution:** The interpreted resolution for images, meaning the number of pixels encoded into tokens and interpreted by the model, depends on the Gemma version you are using:\n - **Gemma 3:** (4B and higher) 896x896 resolution, with pan and scan options for larger images.\n - **Gemma 3n:** 256x256, 512x512, or 768x768 resolution\n - **PaliGemma 2:** 224x224, 448x448, or 896x896 resolution\n\nLower-resolution images are typically processed faster at the cost of having\nfewer interpretable visual details. If you want to optimize processing speed for\nimage data, you should aim to provide visual data at one of the interpreted\nresolution sizes of the Gemma model you are using.\n\nDo's\n----\n\nHere are some best practices to follow when prompting Gemma with visual data.\n\n- **Be specific**: If you have any specific tasks, provide sufficient context\n and guidance. Instead of \"describe this image\", try \"describe the scene in this\n image, focusing on the relationship between the people and the objects.\"\n\n- **Provide constraints**: To achieve a particular style or tone, be sure to\n specify it in your prompt. For example, instead of a general story request, ask\n Gemma to \"Write a short story about this image in the style of a film noir.\"\n\n- **Iterative Refinement**: Getting the intended output often requires\n experimentation and refining the prompts. Begin with a basic prompt and\n gradually add complexity.\n\nDon'ts\n------\n\nHere are some things to avoid when prompting Gemma with visual data.\n\n- **Expect Pixel-Perfect Precision from Gemma**: Tasks requiring precise\n pixel-level analysis, such as detailed object detection and OCR, are best\n handled by dedicated computer vision models. Gemma, for example, cannot\n accurately count individual blades of grass in an image, only provide an\n approximation.\n\n- **Vague or Ambiguous Prompts**: Instead of general prompts like \"Generate\n something based on this image\", provide specific instructions to achieve\n intended outputs. Clearly define what \"something\" is. For example, a poem,\n recipe, or code snippet.\n\n- **Ignore Model Limitations**: Understanding Gemma's limitations is vital for\n effective use. Asking it to \"Analyze this X-ray image and tell me the patient's\n exact medical condition\" is a clear example of misuse, potentially leading to\n harmful medical misinformation."]]