Gemini can respond to prompts about audio. For example, Gemini can:
- Describe, summarize, or answer questions about audio content.
- Provide a transcription of the audio.
- Provide answers or a transcription about a specific segment of the audio.
This guide demonstrates different ways to interact with audio files and audio content using the Gemini API.
Supported audio formats
Gemini supports the following audio format MIME types:
- WAV -
audio/wav
- MP3 -
audio/mp3
- AIFF -
audio/aiff
- AAC -
audio/aac
- OGG Vorbis -
audio/ogg
- FLAC -
audio/flac
Technical details about audio
Gemini imposes the following rules on audio:
- Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
- Gemini can only infer responses to English-language speech.
- Gemini can "understand" non-speech components, such as birdsong or sirens.
- The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
- Gemini downsamples audio files to a 16 Kbps data resolution.
- If the audio source contains multiple channels, Gemini combines those channels down to a single channel.
Before you begin: Set up your project and API key
Before calling the Gemini API, you need to set up your project and configure your API key.
Get and secure your API key
You need an API key to call the Gemini API. If you don't already have one, create a key in Google AI Studio.
It's strongly recommended that you do not check an API key into your version control system.
This tutorial assumes that you're accessing your API key as an environment variable.
Make an audio file available to Gemini
You can make an audio file available to Gemini in either of the following ways:
- Upload the audio file prior to making the prompt request.
- Provide the audio file as inline data to the prompt request.
Upload an audio file and generate content
You can use the File API to upload an audio file of any size. Always use the File API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB.
Call media.upload
to upload a file using the
File API. The following code uploads an audio file and then uses the file in a
call to
models.generateContent
.
MIME_TYPE=$(file -b --mime-type "${AUDIO_PATH}")
NUM_BYTES=$(wc -c < "${AUDIO_PATH}")
DISPLAY_NAME=AUDIO
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "${BASE_URL}/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${AUDIO_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Describe this audio clip"},
{"file_data":{"mime_type": "audio/mp3", "file_uri": '$file_uri'}}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
Get metadata for a file
You can verify the API successfully stored the uploaded file and get its
metadata by calling files.get
.
name=$(jq ".file.name" file_info.json)
# Get the file of interest to check state
curl https://generativelanguage.googleapis.com/v1beta/files/$name > file_info.json
# Print some information about the file you got
name=$(jq ".file.name" file_info.json)
echo name=$name
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
List uploaded files
You can upload multiple audio files (and other kinds of files). The following code generates a list of all the files uploaded:
echo "My files: "
curl "https://generativelanguage.googleapis.com/v1beta/files?key=$GOOGLE_API_KEY"
Delete uploaded files
Files are automatically deleted after 48 hours. Optionally, you can manually delete an uploaded file. For example:
curl --request "DELETE" https://generativelanguage.googleapis.com/v1beta/files/$name?key=$GOOGLE_API_KEY
What's next
This guide shows how to upload audio files using the File API and then generate text outputs from audio inputs. To learn more, see the following resources:
- File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.