The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Using the Multimodal Live API, you can provide end users with the experience of natural, human-like voice conversations, and with the ability to interrupt the model's responses using voice commands. The model can process text, audio, and video input, and it can provide text and audio output.
You can try the Multimodal Live API in Google AI Studio.
Use the Multimodal Live API
This section describes how to use the Multimodal Live API with one of our SDKs. For more information about the underlying WebSockets API, see the WebSockets API reference below.
Send and receive text
import asyncio
from google import genai
client = genai.Client(api_key="GEMINI_API_KEY", http_options={'api_version': 'v1alpha'})
model = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}
async def main():
async with client.aio.live.connect(model=model, config=config) as session:
while True:
message = input("User> ")
if message.lower() == "exit":
break
await session.send(input=message, end_of_turn=True)
async for response in session.receive():
if response.text is not None:
print(response.text, end="")
if __name__ == "__main__":
asyncio.run(main())
Receive audio
The following example shows how to receive audio data and write it to a .wav
file.
import asyncio
import wave
from google import genai
client = genai.Client(api_key="GEMINI_API_KEY", http_options={'api_version': 'v1alpha'})
model = "gemini-2.0-flash-exp"
config = {"response_modalities": ["AUDIO"]}
async def main():
async with client.aio.live.connect(model=model, config=config) as session:
wf = wave.open("audio.wav", "wb")
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
message = "Hello? Gemini are you there?"
await session.send(input=message, end_of_turn=True)
async for idx,response in async_enumerate(session.receive()):
if response.data is not None:
wf.writeframes(response.data)
# Comment this out to print audio data info
# if response.server_content.model_turn is not None:
# print(response.server_content.model_turn.parts[0].inline_data.mime_type)
wf.close()
if __name__ == "__main__":
asyncio.run(main())
Audio formats
The Multimodal Live API supports the following audio formats:
- Input audio format: Raw 16 bit PCM audio at 16kHz little-endian
- Output audio format: Raw 16 bit PCM audio at 24kHz little-endian
Stream audio and video
System instructions
System instructions let you steer the behavior of a model based on your specific needs and use cases. System instructions can be set in the setup configuration and will remain in effect for the entire session.
from google.genai import types
config = {
"system_instruction": types.Content(
parts=[
types.Part(
text="You are a helpful assistant and answer in a friendly tone."
)
]
),
"response_modalities": ["TEXT"],
}
Incremental content updates
Use incremental updates to send text input, establish session context, or restore session context. For short contexts you can send turn-by-turn interactions to represent the exact sequence of events:
Python
from google.genai import types
turns = [
types.Content(parts=[types.Part(text="What is the capital of France?")], role="user"),
types.Content(parts=[types.Part(text="Paris")], role="model")
]
await session.send(input=types.LiveClientContent(turns=turns))
turns = [types.Content(parts=[types.Part(text="What is the capital of Germany?")], role="user")]
await session.send(input=types.LiveClientContent(turns=turns, turn_complete=True))
JSON
{
"clientContent": {
"turns": [
{
"parts":[
{
"text": ""
}
],
"role":"user"
},
{
"parts":[
{
"text": ""
}
],
"role":"model"
}
],
"turnComplete": true
}
}
For longer contexts it's recommended to provide a single message summary to free up the context window for subsequent interactions.
Change voices
Multimodal Live API supports the following voices: Aoede, Charon, Fenrir, Kore, and Puck.
To specify a voice, set the voice name within the speechConfig
object as part
of the session configuration:
Python
from google.genai import types
config = types.LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Kore")
)
)
)
JSON
{
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Kore"
}
}
}
Use function calling
You can define tools with the Multimodal Live API. See the Function calling tutorial to learn more about function calling.
Tools must be defined as part of the session configuration:
config = types.LiveConnectConfig(
response_modalities=["TEXT"],
tools=[set_light_values]
)
async with client.aio.live.connect(model=model, config=config) as session:
await session.send(input="Turn the lights down to a romantic level", end_of_turn=True)
async for response in session.receive():
print(response.tool_call)
From a single prompt, the model can generate multiple function calls and the code necessary to chain their outputs. This code executes in a sandbox environment, generating subsequent BidiGenerateContentToolCall messages. The execution pauses until the results of each function call are available, which ensures sequential processing.
The client should respond with BidiGenerateContentToolResponse.
Audio inputs and audio outputs negatively impact the model's ability to use function calling.
Handle interruptions
Users can interrupt the model's output at any time. When Voice activity detection (VAD) detects an interruption, the ongoing generation is canceled and discarded. Only the information already sent to the client is retained in the session history. The server then sends a BidiGenerateContentServerContent message to report the interruption.
In addition, the Gemini server discards any pending function calls and sends
a BidiGenerateContentServerContent
message with the IDs of the canceled calls.
async for response in session.receive():
if response.server_content.interrupted is not None:
# The generation was interrupted
Limitations
Consider the following limitations of Multimodal Live API and Gemini 2.0 when you plan your project.
Client authentication
Multimodal Live API only provides server to server authentication and isn't recommended for direct client use. Client input should be routed through an intermediate application server for secure authentication with the Multimodal Live API.
Conversation history
While the model keeps track of in-session interactions, conversation history isn't stored. When a session ends, the corresponding context is erased.
In order to restore a previous session or provide the model with historic context of user interactions, the application should maintain its own conversation log and use a BidiGenerateContentClientContent message to send this information at the start of a new session.
Maximum session duration
Session duration is limited to up to 15 minutes for audio or up to 2 minutes of audio and video. When the session duration exceeds the limit, the connection is terminated.
The model is also limited by the context size. Sending large chunks of content alongside the video and audio streams may result in earlier session termination.
Voice activity detection (VAD)
The model automatically performs voice activity detection (VAD) on a continuous audio input stream. VAD is always enabled, and its parameters aren't configurable.
Token count
Token count isn't supported.
Rate limits
The following rate limits apply:
- 3 concurrent sessions per API key
- 4M tokens per minute
WebSockets API reference
Multimodal Live API is a stateful API that uses WebSockets. In this section, you'll find additional details regarding the WebSockets API.
Sessions
A WebSocket connection establishes a session between the client and the Gemini server. After a client initiates a new connection the session can exchange messages with the server to:
- Send text, audio, or video to the Gemini server.
- Receive audio, text, or function call requests from the Gemini server.
The initial message after connection sets the session configuration, which includes the model, generation parameters, system instructions, and tools.
See the following example configuration. Note that the name casing in SDKs may vary. You can look up the Python SDK configuration options here.
{
"model": string,
"generationConfig": {
"candidateCount": integer,
"maxOutputTokens": integer,
"temperature": number,
"topP": number,
"topK": integer,
"presencePenalty": number,
"frequencyPenalty": number,
"responseModalities": [string],
"speechConfig": object
},
"systemInstruction": string,
"tools": [object]
}
Send messages
To exchange messages over the WebSocket connection, the client must send a JSON object over an open WebSocket connection. The JSON object must have exactly one of the fields from the following object set:
{
"setup": BidiGenerateContentSetup,
"clientContent": BidiGenerateContentClientContent,
"realtimeInput": BidiGenerateContentRealtimeInput,
"toolResponse": BidiGenerateContentToolResponse
}
Supported client messages
See the supported client messages in the following table:
Message | Description |
---|---|
BidiGenerateContentSetup |
Session configuration to be sent in the first message |
BidiGenerateContentClientContent |
Incremental content update of the current conversation delivered from the client |
BidiGenerateContentRealtimeInput |
Real time audio or video input |
BidiGenerateContentToolResponse |
Response to a ToolCallMessage received from the server |
Receive messages
To receive messages from Gemini, listen for the WebSocket 'message' event, and then parse the result according to the definition of the supported server messages.
See the following:
async with client.aio.live.connect(model='...', config=config) as session:
await session.send(input='Hello world!', end_of_turn=True)
async for message in session.receive():
print(message)
Server messages will have exactly one of the fields from the following object set:
{
"setupComplete": BidiGenerateContentSetupComplete,
"serverContent": BidiGenerateContentServerContent,
"toolCall": BidiGenerateContentToolCall,
"toolCallCancellation": BidiGenerateContentToolCallCancellation
}
Supported server messages
See the supported server messages in the following table:
Message | Description |
---|---|
BidiGenerateContentSetupComplete |
A BidiGenerateContentSetup message from the client, sent when setup is complete |
BidiGenerateContentServerContent |
Content generated by the model in response to a client message |
BidiGenerateContentToolCall |
Request for the client to run the function calls and return the responses with the matching IDs |
BidiGenerateContentToolCallCancellation |
Sent when a function call is canceled due to the user interrupting model output |
Messages and events
BidiGenerateContentClientContent
Incremental update of the current conversation delivered from the client. All of the content here is unconditionally appended to the conversation history and used as part of the prompt to the model to generate content.
A message here will interrupt any current model generation.
Fields | |
---|---|
turns[] |
Optional. The content appended to the current conversation with the model. For single-turn queries, this is a single instance. For multi-turn queries, this is a repeated field that contains conversation history and the latest request. |
turn_ |
Optional. If true, indicates that the server content generation should start with the currently accumulated prompt. Otherwise, the server awaits additional messages before starting generation. |
BidiGenerateContentRealtimeInput
User input that is sent in real time.
This is different from BidiGenerateContentClientContent
in a few ways:
- Can be sent continuously without interruption to model generation.
- If there is a need to mix data interleaved across the
BidiGenerateContentClientContent
and theBidiGenerateContentRealtimeInput
, the server attempts to optimize for best response, but there are no guarantees. - End of turn is not explicitly specified, but is rather derived from user activity (for example, end of speech).
- Even before the end of turn, the data is processed incrementally to optimize for a fast start of the response from the model.
- Is always assumed to be the user's input (cannot be used to populate conversation history). Can be sent continuously without interruptions. The model automatically detects the beginning and the end of user speech and starts or terminates streaming the response accordingly. Data is processed incrementally as it arrives, minimizing latency.
Fields | |
---|---|
media_ |
Optional. Inlined bytes data for media input. |
BidiGenerateContentServerContent
Incremental server update generated by the model in response to client messages.
Content is generated as quickly as possible, and not in real time. Clients may choose to buffer and play it out in real time.
Fields | |
---|---|
turn_ |
Output only. If true, indicates that the model is done generating. Generation will only start in response to additional client messages. Can be set alongside |
interrupted |
Output only. If true, indicates that a client message has interrupted current model generation. If the client is playing out the content in real time, this is a good signal to stop and empty the current playback queue. |
grounding_ |
Output only. Grounding metadata for the generated content. |
model_ |
Output only. The content that the model has generated as part of the current conversation with the user. |
BidiGenerateContentSetup
Message to be sent in the first and only first client message. Contains configuration that will apply for the duration of the streaming session.
Clients should wait for a BidiGenerateContentSetupComplete
message before sending any additional messages.
Fields | |
---|---|
model |
Required. The model's resource name. This serves as an ID for the Model to use. Format: |
generation_ |
Optional. Generation config. The following fields are not supported:
|
system_ |
Optional. The user provided system instructions for the model. Note: Only text should be used in parts. Content in each part will be in a separate paragraph. |
tools[] |
Optional. A list of A |
BidiGenerateContentSetupComplete
This type has no fields.
Sent in response to a BidiGenerateContentSetup
message from the client.
BidiGenerateContentToolCall
Request for the client to execute the function calls and return the responses with the matching id
s.
Fields | |
---|---|
function_ |
Output only. The function call to be executed. |
BidiGenerateContentToolCallCancellation
Notification for the client that a previously issued ToolCallMessage
with the specified id
s should not have been executed and should be cancelled. If there were side-effects to those tool calls, clients may attempt to undo the tool calls. This message occurs only in cases where the clients interrupt server turns.
Fields | |
---|---|
ids[] |
Output only. The ids of the tool calls to be cancelled. |
BidiGenerateContentToolResponse
Client generated response to a ToolCall
received from the server. Individual FunctionResponse
objects are matched to the respective FunctionCall
objects by the id
field.
Note that in the unary and server-streaming GenerateContent APIs function calling happens by exchanging the Content
parts, while in the bidi GenerateContent APIs function calling happens over these dedicated set of messages.
Fields | |
---|---|
function_ |
Optional. The response to the function calls. |
More information on common types
For more information on the commonly-used API resource types Blob
,
Content
, FunctionCall
, FunctionResponse
, GenerationConfig
,
GroundingMetadata
, and Tool
, see
Generating content.
Third-party integrations
For web and mobile app deployments, you can explore options from: