Gemini API 可使用原生文字轉語音 (TTS) 生成功能,將文字輸入內容轉換為單一或多位說話者的音訊。文字轉語音 (TTS) 生成功能可控,也就是說,你可以使用自然語言建構互動內容,並引導音訊的風格、口音、速度和語氣。
TTS 功能與透過 Live API 提供的語音生成功能不同,後者適用於互動式非結構化音訊,以及多模態輸入和輸出。Live API 擅長處理動態對話情境,而 Gemini API 的 TTS 則適用於需要精確朗讀文字,並細微控制風格和聲音的情境,例如生成 Podcast 或有聲書。
本指南說明如何從文字生成單一說話者和多位說話者的音訊。
事前準備
請務必使用具有原生文字轉語音 (TTS) 功能的 Gemini 2.5 模型變體,如「支援的模型」一節所述。為獲得最佳結果,請考慮哪種模型最適合您的特定用途。
建議您先在 AI Studio 中測試 Gemini 2.5 TTS 模型,再開始建構。
單一說話者文字轉語音
如要將文字轉換為單一說話者音訊,請將回應模式設為「audio」,並傳遞 VoiceConfig 已設定的 SpeechConfig 物件。你必須從預先建構的輸出聲音中選擇聲音名稱。
這個範例會將模型輸出的音訊儲存為 Wave 檔案:
Python
from google import genai
from google.genai import types
import wave
# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)
client = genai.Client()
response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents="Say cheerfully: Have a wonderful day!",
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name='Kore',
            )
         )
      ),
   )
)
data = response.candidates[0].content.parts[0].inline_data.data
file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory
JavaScript
import {GoogleGenAI} from '@google/genai';
import wav from 'wav';
async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });
      writer.on('finish', resolve);
      writer.on('error', reject);
      writer.write(pcmData);
      writer.end();
   });
}
async function main() {
   const ai = new GoogleGenAI({});
   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: 'Say cheerfully: Have a wonderful day!' }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               voiceConfig: {
                  prebuiltVoiceConfig: { voiceName: 'Kore' },
               },
            },
      },
   });
   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');
   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}
await main();
REST
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "contents": [{
          "parts":[{
            "text": "Say cheerfully: Have a wonderful day!"
          }]
        }],
        "generationConfig": {
          "responseModalities": ["AUDIO"],
          "speechConfig": {
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }
        },
        "model": "gemini-2.5-flash-preview-tts",
    }' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
          base64 --decode >out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav
多位說話者文字轉語音
如要使用多個揚聲器播放音訊,您需要 MultiSpeakerVoiceConfig 物件,並將每個揚聲器 (最多 2 個) 設為 SpeakerVoiceConfig。您必須使用與提示中相同的名稱定義每個 speaker:
Python
from google import genai
from google.genai import types
import wave
# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)
client = genai.Client()
prompt = """TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?"""
response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=prompt,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Joe',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Jane',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)
data = response.candidates[0].content.parts[0].inline_data.data
file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory
JavaScript
import {GoogleGenAI} from '@google/genai';
import wav from 'wav';
async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });
      writer.on('finish', resolve);
      writer.on('error', reject);
      writer.write(pcmData);
      writer.end();
   });
}
async function main() {
   const ai = new GoogleGenAI({});
   const prompt = `TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?`;
   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: prompt }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               multiSpeakerVoiceConfig: {
                  speakerVoiceConfigs: [
                        {
                           speaker: 'Joe',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Kore' }
                           }
                        },
                        {
                           speaker: 'Jane',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Puck' }
                           }
                        }
                  ]
               }
            }
      }
   });
   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');
   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}
await main();
REST
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
  "contents": [{
    "parts":[{
      "text": "TTS the following conversation between Joe and Jane:
                Joe: Hows it going today Jane?
                Jane: Not too bad, how about you?"
    }]
  }],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "multiSpeakerVoiceConfig": {
        "speakerVoiceConfigs": [{
            "speaker": "Joe",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }, {
            "speaker": "Jane",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Puck"
              }
            }
          }]
      }
    }
  },
  "model": "gemini-2.5-flash-preview-tts",
}' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
    base64 --decode > out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav
使用提示控制語音風格
你可以使用自然語言提示控制單一和多位說話者的文字轉語音風格、語氣、口音和速度。舉例來說,在單一說話者的提示中,你可以說:
Say in an spooky whisper:
"By the pricking of my thumbs...
Something wicked this way comes"
在多位說話者的提示中,請提供每位說話者的姓名和相應的轉錄稿。你也可以個別為每位揚聲器提供指引:
Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:
Speaker1: So... what's on the agenda today?
Speaker2: You're never going to guess!
試著使用符合你想傳達風格或情緒的語音選項,進一步強調重點。舉例來說,在先前的提示中,土衛二的氣音可能強調「疲倦」和「無聊」,而帕克的歡快語氣則可襯托「興奮」和「快樂」。
正在生成轉換為語音的提示
文字轉語音模型只會輸出音訊,但您可以先使用其他模型生成轉錄稿,然後將轉錄稿傳遞至文字轉語音模型,讓模型朗讀內容。
Python
from google import genai
from google.genai import types
client = genai.Client()
transcript = client.models.generate_content(
   model="gemini-2.0-flash",
   contents="""Generate a short transcript around 100 words that reads
            like it was clipped from a podcast by excited herpetologists.
            The hosts names are Dr. Anya and Liam.""").text
response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=transcript,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Dr. Anya',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Liam',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)
# ...Code to stream or save the output
JavaScript
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});
async function main() {
const transcript = await ai.models.generateContent({
   model: "gemini-2.0-flash",
   contents: "Generate a short transcript around 100 words that reads like it was clipped from a podcast by excited herpetologists. The hosts names are Dr. Anya and Liam.",
   })
const response = await ai.models.generateContent({
   model: "gemini-2.5-flash-preview-tts",
   contents: transcript,
   config: {
      responseModalities: ['AUDIO'],
      speechConfig: {
         multiSpeakerVoiceConfig: {
            speakerVoiceConfigs: [
                   {
                     speaker: "Dr. Anya",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Kore"},
                     }
                  },
                  {
                     speaker: "Liam",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Puck"},
                    }
                  }
                ]
              }
            }
      }
  });
}
// ..JavaScript code for exporting .wav file for output audio
await main();
語音選項
TTS 模型支援 voice_name 欄位中的下列 30 個語音選項:
| Zephyr - Bright | Puck - Upbeat | Charon - 獲得了實用的資訊 | 
| 韓國 - 公司 | Fenrir - 興奮 | Leda -- 青春 | 
| Orus -- Firm | Aoede - Breezy | Callirrhoe - 隨和 | 
| Autonoe - Bright | Enceladus -- Breathy | Iapetus -- Clear | 
| Umbriel -- Easy-going | Algieba - 平滑 | Despina -- Smooth | 
| Erinome -- Clear | Algenib - Gravelly | Rasalgethi - 實用資訊 | 
| Laomedeia - Upbeat | Achernar -- Soft | Alnilam - Firm | 
| Schedar -- Even | Gacrux - 成人內容 | Pulcherrima - 積極 | 
| Achird -- Friendly | Zubenelgenubi - Casual | Vindemiatrix -- Gentle | 
| Sadachbia -- Lively | Sadaltager - 知識豐富 | Sulafat -- 溫暖 | 
你可以在 AI Studio 中試聽所有語音選項。
支援的語言
文字轉語音模型會自動偵測輸入語言。支援下列 24 種語言:
| 語言 | BCP-47 代碼 | 語言 | BCP-47 代碼 | 
|---|---|---|---|
| 阿拉伯文 (埃及) | ar-EG | 德文 (德國) | de-DE | 
| 英文 (美國) | en-US | 西班牙文 (美國) | es-US | 
| 法文 (法國) | fr-FR | 北印度文 (印度) | hi-IN | 
| 印尼文 (印尼) | id-ID | 義大利文 (義大利) | it-IT | 
| 日文 (日本) | ja-JP | 韓文 (韓國) | ko-KR | 
| 葡萄牙文 (巴西) | pt-BR | 俄文 (俄羅斯) | ru-RU | 
| 荷蘭文 (荷蘭) | nl-NL | 波蘭文 (波蘭) | pl-PL | 
| 泰文 (泰國) | th-TH | 土耳其文 (土耳其) | tr-TR | 
| 越南文 (越南) | vi-VN | 羅馬尼亞文 (羅馬尼亞) | ro-RO | 
| 烏克蘭文 (烏克蘭) | uk-UA | 孟加拉文 (孟加拉) | bn-BD | 
| 英文 (印度) | en-IN和hi-IN套裝組合 | 馬拉地文 (印度) | mr-IN | 
| 泰米爾文 (印度) | ta-IN | 泰盧固文 (印度) | te-IN | 
支援的模型
| 模型 | 單一說話者 | 多音箱 | 
|---|---|---|
| Gemini 2.5 Flash 預先發布版 TTS | ✔️ | ✔️ | 
| Gemini 2.5 Pro 預先發布版 TTS | ✔️ | ✔️ | 
限制
- TTS 模型只能接收文字輸入內容,並生成音訊輸出內容。
- TTS 工作階段的脈絡窗口限制為 32,000 個權杖。
- 如需語言支援資訊,請參閱「語言」一節。