Gemini API 可使用原生文字轉語音 (TTS) 生成功能,將文字輸入內容轉換為單人或多人語音音訊。文字轉語音 (TTS) 生成功能可控,也就是說,你可以使用自然語言建構互動,並引導音訊的風格、口音、語速和語氣。
TTS 功能與Live API 提供的語音生成功能不同,後者專為互動式非結構化音訊,以及多模態輸入和輸出內容而設計。Live API 擅長處理動態對話情境,而 Gemini API 的 TTS 則適用於需要準確朗讀文字,並精細控制風格和聲音的場景,例如生成 Podcast 或有聲書。
本指南說明如何從文字產生單一說話者和多位說話者的音訊。
事前準備
請務必使用具有原生文字轉語音 (TTS) 功能的 Gemini 2.5 模型變體,如「支援的模型」一節所述。為獲得最佳結果,請考慮哪種模型最適合您的特定用途。
建議您先在 AI Studio 中測試 Gemini 2.5 TTS 模型,再開始建構。
單一說話者文字轉語音
如要將文字轉換為單一說話者音訊,請將回應模式設為「audio」,並傳遞 VoiceConfig 已設定的 SpeechConfig 物件。您必須從預先建構的輸出語音中選擇語音名稱。
這個範例會將模型的輸出音訊儲存為 Wave 檔案:
Python
from google import genai
from google.genai import types
import wave
# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
with wave.open(filename, "wb") as wf:
wf.setnchannels(channels)
wf.setsampwidth(sample_width)
wf.setframerate(rate)
wf.writeframes(pcm)
client = genai.Client()
response = client.models.generate_content(
model="gemini-2.5-flash-preview-tts",
contents="Say cheerfully: Have a wonderful day!",
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name='Kore',
)
)
),
)
)
data = response.candidates[0].content.parts[0].inline_data.data
file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory
JavaScript
import {GoogleGenAI} from '@google/genai';
import wav from 'wav';
async function saveWaveFile(
filename,
pcmData,
channels = 1,
rate = 24000,
sampleWidth = 2,
) {
return new Promise((resolve, reject) => {
const writer = new wav.FileWriter(filename, {
channels,
sampleRate: rate,
bitDepth: sampleWidth * 8,
});
writer.on('finish', resolve);
writer.on('error', reject);
writer.write(pcmData);
writer.end();
});
}
async function main() {
const ai = new GoogleGenAI({});
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: 'Say cheerfully: Have a wonderful day!' }] }],
config: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Kore' },
},
},
},
});
const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
const audioBuffer = Buffer.from(data, 'base64');
const fileName = 'out.wav';
await saveWaveFile(fileName, audioBuffer);
}
await main();
REST
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts":[{
"text": "Say cheerfully: Have a wonderful day!"
}]
}],
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Kore"
}
}
}
},
"model": "gemini-2.5-flash-preview-tts",
}' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
base64 --decode >out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav
多位說話者文字轉語音
如要使用多個揚聲器播放音訊,您需要 MultiSpeakerVoiceConfig 物件,並將每個揚聲器 (最多 2 個) 設為 SpeakerVoiceConfig。您必須使用與提示中相同的名稱定義每個 speaker:
Python
from google import genai
from google.genai import types
import wave
# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
with wave.open(filename, "wb") as wf:
wf.setnchannels(channels)
wf.setsampwidth(sample_width)
wf.setframerate(rate)
wf.writeframes(pcm)
client = genai.Client()
prompt = """TTS the following conversation between Joe and Jane:
Joe: How's it going today Jane?
Jane: Not too bad, how about you?"""
response = client.models.generate_content(
model="gemini-2.5-flash-preview-tts",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
speaker_voice_configs=[
types.SpeakerVoiceConfig(
speaker='Joe',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name='Kore',
)
)
),
types.SpeakerVoiceConfig(
speaker='Jane',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name='Puck',
)
)
),
]
)
)
)
)
data = response.candidates[0].content.parts[0].inline_data.data
file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory
JavaScript
import {GoogleGenAI} from '@google/genai';
import wav from 'wav';
async function saveWaveFile(
filename,
pcmData,
channels = 1,
rate = 24000,
sampleWidth = 2,
) {
return new Promise((resolve, reject) => {
const writer = new wav.FileWriter(filename, {
channels,
sampleRate: rate,
bitDepth: sampleWidth * 8,
});
writer.on('finish', resolve);
writer.on('error', reject);
writer.write(pcmData);
writer.end();
});
}
async function main() {
const ai = new GoogleGenAI({});
const prompt = `TTS the following conversation between Joe and Jane:
Joe: How's it going today Jane?
Jane: Not too bad, how about you?`;
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: prompt }] }],
config: {
responseModalities: ['AUDIO'],
speechConfig: {
multiSpeakerVoiceConfig: {
speakerVoiceConfigs: [
{
speaker: 'Joe',
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Kore' }
}
},
{
speaker: 'Jane',
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Puck' }
}
}
]
}
}
}
});
const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
const audioBuffer = Buffer.from(data, 'base64');
const fileName = 'out.wav';
await saveWaveFile(fileName, audioBuffer);
}
await main();
REST
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-X POST \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts":[{
"text": "TTS the following conversation between Joe and Jane:
Joe: Hows it going today Jane?
Jane: Not too bad, how about you?"
}]
}],
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"multiSpeakerVoiceConfig": {
"speakerVoiceConfigs": [{
"speaker": "Joe",
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Kore"
}
}
}, {
"speaker": "Jane",
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Puck"
}
}
}]
}
}
},
"model": "gemini-2.5-flash-preview-tts",
}' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
base64 --decode > out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav
使用提示控制語音風格
無論是單人還是多人 TTS,都能使用自然語言提示詞控制風格、語氣、口音和語速。舉例來說,在單一說話者提示中,你可以說:
Say in an spooky whisper:
"By the pricking of my thumbs...
Something wicked this way comes"
在多位說話者的提示中,請提供每位說話者的姓名和相應的轉錄稿。你也可以個別為每位揚聲器提供指引:
Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:
Speaker1: So... what's on the agenda today?
Speaker2: You're never going to guess!
試著使用符合你想傳達風格或情緒的語音選項,進一步強調重點。舉例來說,在先前的提示中,土衛二的氣音可能強調「疲倦」和「無聊」,而帕克的歡快語氣則可與「興奮」和「快樂」相輔相成。
正在生成轉換為語音的提示
TTS 模型只會輸出音訊,但您可以先使用其他模型生成轉錄稿,然後將轉錄稿傳遞至 TTS 模型朗讀。
Python
from google import genai
from google.genai import types
client = genai.Client()
transcript = client.models.generate_content(
model="gemini-2.0-flash",
contents="""Generate a short transcript around 100 words that reads
like it was clipped from a podcast by excited herpetologists.
The hosts names are Dr. Anya and Liam.""").text
response = client.models.generate_content(
model="gemini-2.5-flash-preview-tts",
contents=transcript,
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
speaker_voice_configs=[
types.SpeakerVoiceConfig(
speaker='Dr. Anya',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name='Kore',
)
)
),
types.SpeakerVoiceConfig(
speaker='Liam',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name='Puck',
)
)
),
]
)
)
)
)
# ...Code to stream or save the output
JavaScript
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});
async function main() {
const transcript = await ai.models.generateContent({
model: "gemini-2.0-flash",
contents: "Generate a short transcript around 100 words that reads like it was clipped from a podcast by excited herpetologists. The hosts names are Dr. Anya and Liam.",
})
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: transcript,
config: {
responseModalities: ['AUDIO'],
speechConfig: {
multiSpeakerVoiceConfig: {
speakerVoiceConfigs: [
{
speaker: "Dr. Anya",
voiceConfig: {
prebuiltVoiceConfig: {voiceName: "Kore"},
}
},
{
speaker: "Liam",
voiceConfig: {
prebuiltVoiceConfig: {voiceName: "Puck"},
}
}
]
}
}
}
});
}
// ..JavaScript code for exporting .wav file for output audio
await main();
語音選項
TTS 模型支援 voice_name 欄位中的下列 30 個語音選項:
| Zephyr -- Bright | Puck - Upbeat | Charon - 獲得了實用的資訊 |
| 韓國 - 堅實 | Fenrir - Excitable | Leda - 年輕 |
| Orus -- Firm | Aoede - Breezy | Callirrhoe - 隨和 |
| Autonoe - Bright | Enceladus -- Breathy | Iapetus -- Clear |
| Umbriel -- 輕鬆 | Algieba - Smooth | Despina -- Smooth |
| Erinome -- Clear | Algenib - Gravelly | Rasalgethi - 實用資訊 |
| Laomedeia - 輕快 | Achernar -- Soft | Alnilam - Firm |
| Schedar - Even | Gacrux - 成人內容 | Pulcherrima - 積極 |
| Achird -- Friendly | Zubenelgenubi -- Casual | Vindemiatrix -- Gentle |
| Sadachbia -- Lively | Sadaltager - 知識豐富 | Sulafat - 溫暖 |
你可以在 AI Studio 中試聽所有語音選項。
支援的語言
文字轉語音模型會自動偵測輸入語言。支援下列 24 種語言:
| 語言 | BCP-47 代碼 | 語言 | BCP-47 代碼 |
|---|---|---|---|
| 阿拉伯文 (埃及) | ar-EG |
德文 (德國) | de-DE |
| 英文 (美國) | en-US |
西班牙文 (美國) | es-US |
| 法文 (法國) | fr-FR |
北印度文 (印度) | hi-IN |
| 印尼文 (印尼) | id-ID |
義大利文 (義大利) | it-IT |
| 日文 (日本) | ja-JP |
韓文 (韓國) | ko-KR |
| 葡萄牙文 (巴西) | pt-BR |
俄文 (俄羅斯) | ru-RU |
| 荷蘭文 (荷蘭) | nl-NL |
波蘭文 (波蘭) | pl-PL |
| 泰文 (泰國) | th-TH |
土耳其文 (土耳其) | tr-TR |
| 越南文 (越南) | vi-VN |
羅馬尼亞文 (羅馬尼亞) | ro-RO |
| 烏克蘭文 (烏克蘭) | uk-UA |
孟加拉文 (孟加拉) | bn-BD |
| 英文 (印度) | en-IN和hi-IN套裝組合 |
馬拉地文 (印度) | mr-IN |
| 泰米爾文 (印度) | ta-IN |
泰盧固文 (印度) | te-IN |
支援的模型
| 型號 | 單一說話者 | 多位說話者 |
|---|---|---|
| Gemini 2.5 Flash 預先發布版 TTS | ✔️ | ✔️ |
| Gemini 2.5 Pro 預先發布版 TTS | ✔️ | ✔️ |
限制
- TTS 模型只能接收文字輸入內容,並生成音訊輸出內容。
- TTS 工作階段的脈絡窗口限制為 32,000 個權杖。
- 如需語言支援資訊,請參閱「語言」部分。
提示撰寫指南
Gemini Native Audio Generation Text-to-Speech (TTS) 模型與傳統 TTS 模型不同,它使用大型語言模型,不僅知道要說什麼,也知道該怎麼說。
如要解鎖這項功能,使用者可以把自己當成導演,為虛擬配音員設定場景。如要製作提示,建議考慮下列元件:聲音設定檔 (定義角色的核心特徵和原型)、場景說明 (建立實體環境和情緒「氛圍」),以及導演附註 (提供更精確的表演指導,包括風格、口音和節奏控制)。
使用者可以提供細微的指令,例如精確的地域口音、特定的副語言特徵 (例如氣音) 或語速,運用模型的語境感知能力,生成極具動態、自然且富有表現力的音訊。為獲得最佳成效,建議您讓轉錄稿和導演提示保持一致,這樣「誰說了什麼」就會與「說了什麼」和「怎麼說」相符。
本指南旨在提供基本指引,並激發您在使用 Gemini TTS 音訊生成功能開發音訊體驗時的靈感。我們很期待看到你的作品!
提示結構
理想的提示應包含下列元素,共同打造優異的成效:
- 語音設定檔:為語音建立角色,定義角色身分、原型和任何其他特徵,例如年齡、背景等。
- 場景:設定舞台。描述實體環境和「氛圍」。
- 導演筆記 - 說明虛擬藝人應注意的重要指示,例如風格、呼吸、節奏、咬字和口音。
- 情境範例:為模型提供情境起點,讓虛擬演員自然進入你設定的場景。
- 轉錄稿:模型會朗讀的文字。為獲得最佳效能,請注意轉錄稿主題和寫作風格應與你提供的指示相關。
完整提示範例:
# AUDIO PROFILE: Jaz R.
## "The Morning Hype"
## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
Jaz is standing up, not sitting, bouncing on the balls of their heels to the
rhythm of a thumping backing track. Their hands fly across the faders on a
massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake
up an entire nation.
### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is
always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated
vowels on excitement words (e.g., "Beauuutiful morning").
Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks
with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead
air, no gaps.
Accent: Jaz is from Brixton, London
### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any
script that requires a charismatic Estuary accent and 11/10 infectious energy.
#### TRANSCRIPT
Yes, massive vibes in the studio! You are locked in and it is absolutely
popping off in London right now. If you're stuck on the tube, or just sat
there pretending to work... stop it. Seriously, I see you. Turn this up!
We've got the project roadmap landing in three, two... let's go!
詳細的提示策略
我們來逐一說明提示的各個元素。
音訊格式設定
簡要描述角色的個性。
- 名稱:為角色命名有助於模型掌握角色特徵,並提升效能。設定場景和情境時,請使用角色名稱
- 角色:在場景中扮演的角色核心身分和原型。例如:電台 DJ、Podcast 創作者、新聞記者等。
範例:
# AUDIO PROFILE: Jaz R.
## "The Morning Hype"
# AUDIO PROFILE: Monica A.
## "The Beauty Influencer"
場景
設定場景的背景,包括地點、情緒和環境細節,以確立基調和氛圍。請描述角色周遭發生的情況,以及這些情況對角色的影響。場景會為整個互動提供環境脈絡,並以細膩自然的方式引導演技。
範例:
## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
Jaz is standing up, not sitting, bouncing on the balls of their heels to the
rhythm of a thumping backing track. Their hands fly across the faders on a
massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to
wake up an entire nation.
## THE SCENE: Homegrown Studio
A meticulously sound-treated bedroom in a suburban home. The space is
deadened by plush velvet curtains and a heavy rug, but there is a
distinct "proximity effect."
導演附註
這個重要章節包含具體的成效指引。您可以略過所有其他元素,但建議您加入這個元素。
請只定義對效能有重要影響的項目,並注意不要過度指定。如果規則過於嚴格,模型創意就會受到限制,成效也可能不盡理想。根據特定演出規則,平衡角色和場景說明。
最常見的指示是風格、節奏和口音,但模型不限於這些指示,也不需要這些指示。您可以視需要加入自訂指令,涵蓋對成效有重要影響的其他詳細資料。
例如:
### DIRECTOR'S NOTES
Style: Enthusiastic and Sassy GenZ beauty YouTuber
Pacing: Speaks at an energetic pace, keeping up with the extremely fast, rapid
delivery influencers use in short form videos.
Accent: Southern california valley girl from Laguna Beach |
樣式:
設定生成語音的語氣和風格。包括歡快、充滿活力、放鬆、無聊等,引導表演。請盡可能提供詳細資訊:「熱情洋溢。聽眾應該要感覺自己是盛大熱鬧社群活動的一份子。」比單純說「充滿活力和熱情」更有效。
你甚至可以試試配音產業常用的術語,例如「聲音微笑」。你可以視需要疊加多種風格特徵。
範例:
簡單情緒
DIRECTORS NOTES
...
Style: Frustrated and angry developer who can't get the build to run.
...
更深入
DIRECTORS NOTES
...
Style: Sassy GenZ beauty YouTuber, who mostly creates content for YouTube Shorts.
...
複雜
DIRECTORS NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is
always raised to keep the tone bright, sunny, and explicitly inviting.
*Dynamics: High projection without shouting. Punchy consonants and
elongated vowels on excitement words (e.g., "Beauuutiful morning").
Accent:
描述想要的口音,描述得越具體,結果就越切合需求。例如使用「英國口音,如英格蘭克羅伊登的口音」,而非「英國口音」。
範例:
### DIRECTORS NOTES
...
Accent: Southern california valley girl from Laguna Beach
...
### DIRECTORS NOTES
...
Accent: Jaz is a from Brixton, London
...
使用速度:
整部作品的整體節奏和節奏變化。
範例:
簡潔
### DIRECTORS NOTES
...
Pacing: Speak as fast as possible
...
更深入
### DIRECTORS NOTES
...
Pacing: Speaks at a faster, energetic pace, keeping up with fast paced music.
...
複雜
### DIRECTORS NOTES
...
Pacing: The "Drift": The tempo is incredibly slow and liquid. Words bleed into each other. There is zero urgency.
...
歡迎試試
歡迎在 AI Studio 試試這些範例,並使用 TTS 應用程式,讓 Gemini 帶您體驗導演的樂趣。請參考以下訣竅,錄製出色的歌唱表演:
- 請務必讓整個提示保持一致,因為腳本和指示是相輔相成,可共同打造優質演出。
- 不必鉅細靡遺地描述所有內容,有時讓模型填補空白處,反而能產生更自然的結果。(就像才華洋溢的演員)
- 如果遇到瓶頸,不妨請 Gemini 協助撰寫劇本或演出。