Prompting with media files


The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting, meaning you can include those types of media files in your prompts. For small files, you can point the Gemini model directly to a local file when providing a prompt. Upload larger files with the File API before including them in prompts.

The File API lets you store up to 20GB of files per project, with each file not exceeding 2GB in size. Files are stored for 48 hours and can be accessed with your API key for generation within that time period and cannot be downloaded from the API. The Files API is available at no cost in all regions where the Gemini API is available.

The File API handles inputs that can be used to generate content with model.generateContent or model.streamGenerateContent. For information on valid file formats (MIME types) and supported models, see Supported file formats.

This guide shows how to use the File API to upload media files and include them in a GenerateContent call to the Gemini API. For more information, see the code samples.

Supported file formats

Gemini models support prompting with multiple file formats. This section explains considerations in using general media formats for prompting, specifically image, audio, video, and plain text files. You can use media files for prompting only with specific model versions, as shown in the following table.

Model Images Audio Video Plain text
Gemini 1.5 Pro (release 008 and later) ✔ (3600 max image files)
Gemini Pro Vision ✔ (16 max image files)

Image formats

You can use image data for prompting with the gemini-pro-vision and gemini-1.5-pro models. When you use images for prompting, they are subject to the following limitations and requirements:

  • Images must be in one of the following image data MIME types:
    • PNG - image/png
    • JPEG - image/jpeg
    • WEBP - image/webp
    • HEIC - image/heic
    • HEIF - image/heif
  • Maximum of 16 individual images for the gemini-pro-vision and 3600 images for gemini-1.5-pro
  • No specific limits to the number of pixels in an image; however, larger images are scaled down to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.

Audio formats

You can use audio data for prompting with the gemini-1.5-pro model. When you use audio for prompting, they are subject to the following limitations and requirements:

  • Audio data is supported in the following common audio format MIME types:
    • WAV - audio/wav
    • MP3 - audio/mp3
    • AIFF - audio/aiff
    • AAC - audio/aac
    • OGG Vorbis - audio/ogg
    • FLAC - audio/flac
  • The maximum supported length of audio data in a single prompt is 9.5 hours.
  • Audio files are resampled down to a 16 Kbps data resolution, and multiple channels of audio are combined into a single channel.
  • There is no specific limit to the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.

Video formats

You can use video data for prompting with the gemini-1.5-pro model.

  • Video data is supported in the following common video format MIME types:

    • video/mp4
    • video/mpeg
    • video/mov
    • video/avi
    • video/x-flv
    • video/mpg
    • video/webm
    • video/wmv
    • video/3gpp
  • The File API service samples videos into images at 1 frame per second (FPS) and may be subject to change to provide the best inference quality. Individual images take up 258 tokens regardless of resolution and quality.

Plain text formats

The File API supports uploading plain text files with the following MIME types:

  • text/plain
  • text/html
  • text/css
  • text/javascript
  • application/x-javascript
  • text/x-typescript
  • application/x-typescript
  • text/csv
  • text/markdown
  • text/x-python
  • application/x-python-code
  • application/json
  • text/xml
  • application/rtf
  • text/rtf

For plain text files with a MIME type not on the list, you can try specifying one of the above MIME types manually.