Explore audio capabilities with the Gemini API

Gemini can respond to prompts about audio. For example, Gemini can:

  • Describe, summarize, or answer questions about the audio content.
  • Provide a transcription of the audio.
  • Provide answers or a transcription about a specific segment of the audio.

This guide demonstrates different ways to:

  • Pass audio to a Gemini model.
  • Prompt the Gemini model about the audio.

Supported audio formats

Gemini supports the following audio format MIME types:

  • WAV - audio/wav
  • MP3 - audio/mp3
  • AIFF - audio/aiff
  • AAC - audio/aac
  • OGG Vorbis - audio/ogg
  • FLAC - audio/flac

Technical details about audio

Gemini imposes the following rules on audio:

  • Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
  • Gemini can only infer responses to English-language speech.
  • Gemini can "understand" non-speech components, such as birdsong or sirens.
  • The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
  • Gemini downsamples audio files to a 16 Kbps data resolution.
  • If the audio source contains multiple channels, Gemini combines those channels down to a single channel.