Prompting with media files


View on ai.google.dev Run in Google Colab View source on GitHub

The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting, meaning you can include those types of media files in your prompts. For small files, you can point the Gemini model directly to a local file when providing a prompt. Upload larger files with the File API before including them in prompts.

The File API lets you store up to 20GB of files per project, with each file not exceeding 2GB in size. Files are stored for 48 hours and can be accessed with your API key for generation within that time period and cannot be downloaded from the API. The Files API is available at no cost in all regions where the Gemini API is available.

The File API handles inputs that can be used to generate content with model.generateContent or model.streamGenerateContent. For information on valid file formats (MIME types) and supported models, see Supported file formats.

This guide shows how to use the File API to upload media files and include them in a GenerateContent call to the Gemini API. For more information, see the code samples.

Before you begin: Set up your project and API key

Before calling the Gemini API (or its File API), you need to set up your project and configure your API key.

Prompting with images

In this tutorial, you upload a sample image using the File API and then use it to generate content.

Upload an image file

Refer to the Appendix section to learn how to upload your own file.

  1. Prepare a sample image to upload:

      curl -o image.jpg https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg
    
  2. Upload that file using media.upload so that you can access it with other API calls:

    sample_file = genai.upload_file(path="image.jpg",
                                display_name="Sample drawing")
    
    print(f"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}")
    

The response shows that the uploaded image is stored with the specified display_name and has a uri to reference the file in Gemini API calls. Use the response to track how uploaded files are mapped to URIs.

Depending on your use case, you can store the URIs in structures, such as a dict or a database.

Get the image file's metadata

After uploading the file, you can verify the API successfully stored the file by calling files.get through the SDK.

This method lets you get the metadata for an uploaded file associated with the Google Cloud project linked to your API key. Only the name (and by extension, the uri) are unique. Use the display_name to identify files only if you manage uniqueness yourself.

file = genai.get_file(name=sample_file.name)
print(f"Retrieved file '{file.display_name}' as: {sample_file.uri}")

Generate content using the uploaded image file

After uploading the image, you can make GenerateContent requests that reference the uri in the response (from either uploading the file or directly getting the metadata of the file).

In this example, you create a prompt that starts with text followed by the URI reference for the uploaded file:

# The Gemini 1.5 models are versatile and work with multimodal prompts
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

response = model.generate_content(["Describe the image with a creative description.", sample_file])

Markdown(">" + response.text)

Delete the image file

Files are automatically deleted after 48 hours. You can also manually delete them using files.delete through the SDK.

genai.delete_file(sample_file.name)
print(f'Deleted {sample_file.display_name}.')

Prompting with videos

In this tutorial, you upload a sample video using the File API and then use it to generate content.

Upload a video file

The Gemini API accepts video file formats directly. This example uses the short film "Big Buck Bunny".

"Big Buck Bunny" is (c) copyright 2008, Blender Foundation / www.bigbuckbunny.org and licensed under the Creative Commons Attribution 3.0 License.

Refer to the Appendix section to learn how to upload your own file.

!wget https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4

Upload that file using media.upload so that you can access it with other API calls:

video_file_name = "BigBuckBunny_320x180.mp4"

print(f"Uploading file...")
video_file = genai.upload_file(path=video_file_name)
print(f"Completed upload: {video_file.uri}")

NOTE: The File API samples the video at 1 frame per second (FPS). This sampling rate may be subject to change to provide the best inference quality.

Get the video file's metadata

Verify the API has successfully received the files by calling the files.get method through the SDK.

Video files have a State field in the File API. When a video is uploaded, it will be in PROCESSING state until it is ready for inference. Only ACTIVE files can be used for model inference.

import time

while video_file.state.name == "PROCESSING":
    print('.', end='')
    time.sleep(10)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
  raise ValueError(video_file.state.name)

Generate content using the uploaded video file

After uploading the video, you can make GenerateContent requests that reference the uri in the response (from either uploading the file or directly getting the metadata of the file).

# Create the prompt.
prompt = "Describe this video."

# The Gemini 1.5 models are versatile and work with multimodal prompts
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([prompt, video_file],
                                  request_options={"timeout": 600})
print(response.text)

Delete the video file

Files are automatically deleted after 48 hours. You can also manually delete them using files.delete through the SDK.

genai.delete_file(file_response.name)
print(f'Deleted file {file_response.uri}')

Supported file formats

Gemini models support prompting with multiple file formats. This section explains considerations in using general media formats for prompting, specifically image, audio, video, and plain text files. You can use media files for prompting only with specific model versions, as shown in the following table.

Model Images Audio Video Plain text
Gemini 1.5 Pro (release 008 and later) ✔ (3600 max image files)
Gemini Pro Vision ✔ (16 max image files)

Image formats

You can use image data for prompting with a Gemini 1.5 model or the Gemini 1.0 Pro Vision model. When you use images for prompting, they are subject to the following limitations and requirements:

  • Images must be in one of the following image data MIME types:
    • PNG - image/png
    • JPEG - image/jpeg
    • WEBP - image/webp
    • HEIC - image/heic
    • HEIF - image/heif
  • Maximum of 16 individual images for the Gemini 1.0 Pro Vision model and 3600 images for the Gemini 1.5 models.
  • No specific limits to the number of pixels in an image; however, larger images are scaled down to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.

Audio formats

You can use audio data for prompting with the Gemini 1.5 models. When you use audio for prompting, they are subject to the following limitations and requirements:

  • Audio data is supported in the following common audio format MIME types:
    • WAV - audio/wav
    • MP3 - audio/mp3
    • AIFF - audio/aiff
    • AAC - audio/aac
    • OGG Vorbis - audio/ogg
    • FLAC - audio/flac
  • The maximum supported length of audio data in a single prompt is 9.5 hours.
  • Audio files are resampled down to a 16 Kbps data resolution, and multiple channels of audio are combined into a single channel.
  • There is no specific limit to the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.

Video formats

You can use video data for prompting with the Gemini 1.5 models.

  • Video data is supported in the following common video format MIME types:

    • video/mp4
    • video/mpeg
    • video/mov
    • video/avi
    • video/x-flv
    • video/mpg
    • video/webm
    • video/wmv
    • video/3gpp
  • The File API service samples videos into images at 1 frame per second (FPS) and may be subject to change to provide the best inference quality. Individual images take up 258 tokens regardless of resolution and quality.

Plain text formats

The File API supports uploading plain text files with the following MIME types:

  • text/plain
  • text/html
  • text/css
  • text/javascript
  • application/x-javascript
  • text/x-typescript
  • application/x-typescript
  • text/csv
  • text/markdown
  • text/x-python
  • application/x-python-code
  • application/json
  • text/xml
  • application/rtf
  • text/rtf

For plain text files with a MIME type not on the list, you can try specifying one of the above MIME types manually.

Appendix: Uploading files to Colab

This notebook uses the File API with files that were downloaded from the internet. If you're running this in Colab and want to use your own files, you first need to upload them to the Colab instance.

First, click Files on the left sidebar, then click the Upload button:

Next, you'll upload that file to the File API. In the form for the code cell below, enter the filename for the file you uploaded and provide an appropriate display name for the file, then run the cell.

my_filename = "gemini_logo.png" # @param {type:"string"}
my_file_display_name = "Gemini Logo" # @param {type:"string"}

my_file = genai.upload_file(path=my_filename,
                            display_name=my_file_display_name)
print(f"Uploaded file '{my_file.display_name}' as: {my_file.uri}")