Explore audio capabilities with the Gemini API

Gemini can respond to prompts about audio. For example, Gemini can:

  • Describe, summarize, or answer questions about audio content.
  • Provide a transcription of the audio.
  • Provide answers or a transcription about a specific segment of the audio.

This guide demonstrates different ways to interact with audio files and audio content using the Gemini API.

Supported audio formats

Gemini supports the following audio format MIME types:

  • WAV - audio/wav
  • MP3 - audio/mp3
  • AIFF - audio/aiff
  • AAC - audio/aac
  • OGG Vorbis - audio/ogg
  • FLAC - audio/flac

Technical details about audio

Gemini imposes the following rules on audio:

  • Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
  • Gemini can only infer responses to English-language speech.
  • Gemini can "understand" non-speech components, such as birdsong or sirens.
  • The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
  • Gemini downsamples audio files to a 16 Kbps data resolution.
  • If the audio source contains multiple channels, Gemini combines those channels down to a single channel.

What's next

This guide shows how to upload audio files using the File API and then generate text outputs from audio inputs. To learn more, see the following resources:

  • File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
  • System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
  • Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.