The Gemini API is able to process images and videos, enabling a multitude of exciting developer use cases. Some of Gemini's vision capabilities include the ability to:
- Caption and answer questions about images
- Transcribe and reason over PDFs, including long documents up to 2 million token context window
- Describe, segment, and extract information from videos, including both visual frames and audio, up to 90 minutes long
- Detect objects in an image and return bounding box coordinates for them
This tutorial demonstrates some possible ways to prompt the Gemini API with images and video input, provides code examples, and outlines prompting best practices with multimodal vision capabilities. All output is text-only.
Before you begin: Set up your project and API key
Before calling the Gemini API, you need to set up your project and configure your API key.
Get and secure your API key
You need an API key to call the Gemini API. If you don't already have one, create a key in Google AI Studio.
It's strongly recommended that you do not check an API key into your version control system.
This tutorial assumes that you're accessing your API key as an environment variable.
Prompting with images
In this tutorial, you will upload images using the File API or as inline data and generate content based on those images.
Technical details (images)
Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 image files.
Images must be in one of the following image data MIME types:
- PNG -
image/png
- JPEG -
image/jpeg
- WEBP -
image/webp
- HEIC -
image/heic
- HEIF -
image/heif
Each image is equivalent to 258 tokens.
While there are no specific limits to the number of pixels in an image besides the model's context window, larger images are scaled down to a maximum resolution of 3072x3072 while preserving their original aspect ratio, while smaller images are scaled up to 768x768 pixels. There is no cost reduction for images at lower sizes, other than bandwidth, or performance improvement for images at higher resolution.
For best results:
- Rotate images to the correct orientation before uploading.
- Avoid blurry images.
- If using a single image, place the text prompt after the image.
Image input
For total image payload size less than 20MB, we recommend either uploading base64 encoded images or directly uploading locally stored image files.
Base64 encoded images
You can upload public image URLs by encoding them as Base64 payloads. We recommend using the httpx library to fetch the image URLs. The following code example shows how to do this:
IMG_PATH=/path/to/your/image1.jpeg
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Caption this image."},
{
"inline_data": {
"mime_type":"image/jpeg",
"data": "'\$(base64 \$B64FLAGS \$IMG_PATH)'"
}
}
]
}]
}' 2> /dev/null
Multiple images
To prompt with multiple images in Base64 encoded format, you can do the following:
IMAGE_PATH_1=/path/to/your/image1.jpeg
IMAGE_PATH_2=/path/to/your/image2.jpeg
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{
"inline_data": {
"mime_type": "image/jpeg",
"data": "'\$(base64 $B64FLAGS "$IMAGE_PATH_1")'"
}
},
{
"inline_data": {
"mime_type": "image/png",
"data": "'\$(base64 $B64FLAGS "$IMAGE_PATH_2")'"
}
},
{
"text": "Generate a list of all the objects contained in both images."
}
]
}]
}' 2> /dev/null
Upload an image and generate content
When the combination of files and system instructions that you intend to send is larger than 20 MB in size, use the File API to upload those files.
Use the media.upload
method of the File API to upload an image of any size.
After uploading the file, you can make GenerateContent
requests that reference
the File API URI. Select the generative model and provide it with a text prompt
and the uploaded image.
MIME_TYPE=$(file -b --mime-type "${IMG_PATH_2}")
NUM_BYTES=$(wc -c < "${IMG_PATH_2}")
DISPLAY_NAME=TEXT
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "${BASE_URL}/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${IMG_PATH_2}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Can you tell me about the instruments in this photo?"},
{"file_data":
{"mime_type": "image/jpeg",
"file_uri": '$file_uri'}
}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
Verify image file upload and get metadata
You can verify the API successfully stored the uploaded file and get its
metadata by calling files.get
. Only the name
(and by extension, the uri
) are unique.
name=$(jq ".file.name" file_info.json)
# Get the file of interest to check state
curl https://generativelanguage.googleapis.com/v1beta/files/$name > file_info.json
# Print some information about the file you got
name=$(jq ".file.name" file_info.json)
echo name=$name
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
OpenAI Compatibility
You can access Gemini's image understanding capabilities using the OpenAI libraries. This lets you integrate Gemini into existing OpenAI workflows by updating three lines of code and using your Gemini API key. See the Image understanding example for code demonstrating how to send images encoded as Base64 payloads.
Capabilities
This section outlines specific vision capabilities of the Gemini model, including object detection and bounding box coordinates.
Get a bounding box for an object
Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you'll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes.
IMG_PATH=/path/to/your/image1.jpeg
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d "{
\"contents\": [{
\"parts\":[
{
\"inline_data\": {
\"mime_type\": \"image/jpeg\",
\"data\": \"'\$(base64 $B64FLAGS \"$IMG_PATH\")'\"
}
},
{
\"text\": \"Generate bounding boxes for each of the objects in this image in [y_min, x_min, y_max, x_max] format.\"
}
]
}]
}" 2> /dev/null
The model returns bounding box coordinates in the format
[ymin, xmin, ymax, xmax]
. To convert these normalized coordinates
to the pixel coordinates of your original image, follow these steps:
- Divide each output coordinate by 1000.
- Multiply the x-coordinates by the original image width.
- Multiply the y-coordinates by the original image height.
Prompting with video
In this tutorial, you will upload a video using the File API and generate content based on those images.
Technical details (video)
Gemini 1.5 Pro and Flash support up to approximately an hour of video data.
Video must be in one of the following video format MIME types:
video/mp4
video/mpeg
video/mov
video/avi
video/x-flv
video/mpg
video/webm
video/wmv
video/3gpp
The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.
Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.
To ask questions about time-stamped locations, use the format MM:SS
, where
the first two digits represent minutes and the last two digits represent
seconds.
For best results:
- Use one video per prompt.
- If using a single video, place the text prompt after the video.
Upload a video and generate content
MIME_TYPE=$(file -b --mime-type "${VIDEO_PATH}")
NUM_BYTES=$(wc -c < "${VIDEO_PATH}")
DISPLAY_NAME=VIDEO_PATH
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "${BASE_URL}/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${VIDEO_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
state=$(jq ".file.state" file_info.json)
echo state=$state
# Ensure the state of the video is 'ACTIVE'
while [[ "($state)" = *"PROCESSING"* ]];
do
echo "Processing video..."
sleep 5
# Get the file of interest to check state
curl https://generativelanguage.googleapis.com/v1beta/files/$name > file_info.json
state=$(jq ".file.state" file_info.json)
done
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Describe this video clip"},
{"file_data":{"mime_type": "video/mp4", "file_uri": '$file_uri'}}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
List files
You can list all files uploaded using the File API and their URIs using
files.list
.
echo "My files: "
curl "https://generativelanguage.googleapis.com/v1beta/files?key=$GOOGLE_API_KEY"
Delete files
Files uploaded using the File API are automatically deleted after 2 days. You
can also manually delete them using
files.delete
.
curl --request "DELETE" https://generativelanguage.googleapis.com/v1beta/files/$name?key=$GOOGLE_API_KEY
What's next
This guide shows how to upload image and video files using the File API and then generate text outputs from image and video inputs. To learn more, see the following resources:
- File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.