Gemini can respond to prompts about audio. For example, Gemini can:
- Describe, summarize, or answer questions about the audio content.
- Provide a transcription of the audio.
- Provide answers or a transcription about a specific segment of the audio.
This guide demonstrates different ways to:
- Pass audio to a Gemini model.
- Prompt the Gemini model about the audio.
Supported audio formats
Gemini supports the following audio format MIME types:
- WAV - audio/wav
- MP3 - audio/mp3
- AIFF - audio/aiff
- AAC - audio/aac
- OGG Vorbis - audio/ogg
- FLAC - audio/flac
Technical details about audio
Gemini imposes the following rules on audio:
- Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
- Gemini can only infer responses to English-language speech.
- Gemini can "understand" non-speech components, such as birdsong or sirens.
- The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
- Gemini downsamples audio files to a 16 Kbps data resolution.
- If the audio source contains multiple channels, Gemini combines those channels down to a single channel.