نعرّفك على Gemini 3. يمكنك الاطّلاع على دليل المطوّر للبدء باستخدام النموذج الأكثر تطوّرًا لدينا حتى الآن.

تمت ترجمة هذه الصفحة بواسطة Cloud Translation API‏.

إنشاء الكلام (تحويل النص إلى كلام)

يمكن لواجهة Gemini API تحويل النص إلى صوت أحادي أو متعدّد المتحدثين باستخدام إمكانات تحويل النص إلى كلام (TTS) المضمّنة. يمكن التحكّم في عملية إنشاء الصوت باستخدام ميزة "تحويل النص إلى كلام"، ما يعني أنّه يمكنك استخدام اللغة الطبيعية لتنظيم التفاعلات وتحديد الأسلوب واللهجة والسرعة والنبرة في الصوت.

تختلف إمكانية تحويل النص إلى كلام عن ميزة إنشاء الكلام المتوفّرة من خلال Live API، وهي مصمّمة لتوفير تجربة تفاعلية للمحتوى الصوتي غير المنظَّم، وللمدخلات والمخرجات المتعددة الوسائط. في حين تتفوّق Live API في سياقات المحادثات الديناميكية، تم تصميم ميزة تحويل النص إلى كلام من خلال Gemini API لتناسب السيناريوهات التي تتطلّب تلاوة نصية دقيقة مع تحكّم دقيق في الأسلوب والصوت، مثل إنشاء ملفات بودكاست أو كتب صوتية.

يوضّح لك هذا الدليل كيفية إنشاء مقاطع صوتية لشخص واحد أو عدة أشخاص من نص.

قبل البدء

تأكَّد من استخدام أحد أنواع نماذج Gemini 2.5 التي تتضمّن إمكانات تحويل النص إلى كلام (TTS) الأصلية، كما هو موضّح في قسم النماذج المتوافقة. للحصول على أفضل النتائج، حدِّد النموذج الأنسب لحالة الاستخدام المحدّدة.

قد يكون من المفيد اختبار نماذج تحويل النص إلى كلام في Gemini 2.5 في AI Studio قبل البدء في الإنشاء.

تحويل النص إلى كلام بصوت متحدث واحد

لتحويل النص إلى صوت أحادي المتحدث، اضبط طريقة الرد على "صوت"، وأرسِل عنصر SpeechConfig مع ضبط VoiceConfig. عليك اختيار اسم صوت من الأصوات الجاهزة.

يحفظ هذا المثال الصوت الناتج من النموذج في ملف موجي:

Python

from google import genai
from google.genai import types
import wave

# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)

client = genai.Client()

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents="Say cheerfully: Have a wonderful day!",
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name='Kore',
            )
         )
      ),
   )
)

data = response.candidates[0].content.parts[0].inline_data.data

file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory

JavaScript

import {GoogleGenAI} from '@google/genai';
import wav from 'wav';

async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });

      writer.on('finish', resolve);
      writer.on('error', reject);

      writer.write(pcmData);
      writer.end();
   });
}

async function main() {
   const ai = new GoogleGenAI({});

   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: 'Say cheerfully: Have a wonderful day!' }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               voiceConfig: {
                  prebuiltVoiceConfig: { voiceName: 'Kore' },
               },
            },
      },
   });

   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');

   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}
await main();

REST

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "contents": [{
          "parts":[{
            "text": "Say cheerfully: Have a wonderful day!"
          }]
        }],
        "generationConfig": {
          "responseModalities": ["AUDIO"],
          "speechConfig": {
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }
        },
        "model": "gemini-2.5-flash-preview-tts",
    }' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
          base64 --decode >out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

تحويل النص إلى كلام لعدة متحدثين

بالنسبة إلى الصوت الذي يتضمّن عدة متحدثين، ستحتاج إلى كائن MultiSpeakerVoiceConfig يتضمّن كل متحدث (بحد أقصى 2) تم ضبطه على أنّه SpeakerVoiceConfig. عليك تحديد كل speaker باستخدام الأسماء نفسها المستخدَمة في الطلب:

Python

from google import genai
from google.genai import types
import wave

# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)

client = genai.Client()

prompt = """TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?"""

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=prompt,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Joe',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Jane',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)

data = response.candidates[0].content.parts[0].inline_data.data

file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory

JavaScript

import {GoogleGenAI} from '@google/genai';
import wav from 'wav';

async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });

      writer.on('finish', resolve);
      writer.on('error', reject);

      writer.write(pcmData);
      writer.end();
   });
}

async function main() {
   const ai = new GoogleGenAI({});

   const prompt = `TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?`;

   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: prompt }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               multiSpeakerVoiceConfig: {
                  speakerVoiceConfigs: [
                        {
                           speaker: 'Joe',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Kore' }
                           }
                        },
                        {
                           speaker: 'Jane',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Puck' }
                           }
                        }
                  ]
               }
            }
      }
   });

   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');

   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}

await main();

REST

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
  "contents": [{
    "parts":[{
      "text": "TTS the following conversation between Joe and Jane:
                Joe: Hows it going today Jane?
                Jane: Not too bad, how about you?"
    }]
  }],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "multiSpeakerVoiceConfig": {
        "speakerVoiceConfigs": [{
            "speaker": "Joe",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }, {
            "speaker": "Jane",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Puck"
              }
            }
          }]
      }
    }
  },
  "model": "gemini-2.5-flash-preview-tts",
}' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
    base64 --decode > out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

التحكّم في أسلوب الكلام باستخدام الطلبات

يمكنك التحكّم في الأسلوب والنبرة واللهجة والسرعة باستخدام طلبات مكتوبة بلغة طبيعية لكل من ميزة تحويل النص إلى كلام بصوت فردي وبأصوات متعدّدة. على سبيل المثال، في طلب يتضمّن متحدثًا واحدًا، يمكنك قول:

Say in an spooky whisper:
"By the pricking of my thumbs...
Something wicked this way comes"

في طلب يتضمّن عدة متحدثين، قدِّم إلى النموذج اسم كل متحدث والنص الخاص به. يمكنك أيضًا تقديم إرشادات لكل متحدث على حدة:

Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:

Speaker1: So... what's on the agenda today?
Speaker2: You're never going to guess!

جرِّب استخدام خيار صوت يتوافق مع النمط أو الشعور الذي تريد التعبير عنه، وذلك للتأكيد عليه بشكل أكبر. في الطلب السابق، على سبيل المثال، قد يؤكّد صوت إنسيلادوس المتهدّج على حالتَي "التعب" و"الملل"، بينما قد تتناسب نبرة بوك المبهجة مع حالتَي "الحماس" و"السعادة".

جارٍ إنشاء طلب لتحويل النص إلى صوت

تنتج نماذج تحويل النص إلى كلام الصوت فقط، ولكن يمكنك استخدام نماذج أخرى لإنشاء نص أولاً، ثم تمرير هذا النص إلى نموذج تحويل النص إلى كلام لقراءته بصوت مرتفع.

Python

from google import genai
from google.genai import types

client = genai.Client()

transcript = client.models.generate_content(
   model="gemini-2.0-flash",
   contents="""Generate a short transcript around 100 words that reads
            like it was clipped from a podcast by excited herpetologists.
            The hosts names are Dr. Anya and Liam.""").text

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=transcript,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Dr. Anya',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Liam',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)

# ...Code to stream or save the output

JavaScript

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({});

async function main() {

const transcript = await ai.models.generateContent({
   model: "gemini-2.0-flash",
   contents: "Generate a short transcript around 100 words that reads like it was clipped from a podcast by excited herpetologists. The hosts names are Dr. Anya and Liam.",
   })

const response = await ai.models.generateContent({
   model: "gemini-2.5-flash-preview-tts",
   contents: transcript,
   config: {
      responseModalities: ['AUDIO'],
      speechConfig: {
         multiSpeakerVoiceConfig: {
            speakerVoiceConfigs: [
                   {
                     speaker: "Dr. Anya",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Kore"},
                     }
                  },
                  {
                     speaker: "Liam",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Puck"},
                    }
                  }
                ]
              }
            }
      }
  });
}
// ..JavaScript code for exporting .wav file for output audio

await main();

خيارات الصوت

تتيح نماذج "تحويل النص إلى كلام" 30 خيارًا للصوت في الحقل voice_name:

Zephyr -- Bright	Puck -- مفعم بالحيوية	شارون -- مفيدة
Kore -- Firm	Fenrir -- متحمّس	Leda -- شبابي
Orus -- شركة	Aoede -- Breezy	Callirrhoe -- مريح
Autonoe -- Bright	‫Enceladus -- Breathy	Iapetus -- Clear
Umbriel -- شخصية هادئة	الجبهة -- ناعم	Despina -- Smooth
Erinome -- محو	Algenib -- Gravelly	Rasalgethi -- مفيدة
‫Laomedeia -- مرح	Achernar -- Soft	Alnilam -- الشركة
Schedar -- Even	Gacrux -- ناضج	Pulcherrima -- واثق
Achird -- ودود	Zubenelgenubi -- غير رسمي	‫Vindemiatrix -- لطيف
Sadachbia -- مفعم بالحيوية	Sadaltager -- مُلمّ	سولفات -- دافئ

يمكنك الاستماع إلى جميع خيارات الأصوات في AI Studio.

اللغات المتاحة

ترصد نماذج تحويل النص إلى كلام لغة الإدخال تلقائيًا. وهي تتوافق مع اللغات الـ 24 التالية:

اللغة	رمز BCP-47	اللغة	رمز BCP-47
العربية (مصر)	`ar-EG`	الألمانية (ألمانيا)	`de-DE`
الإنجليزية (الولايات المتحدة)	`en-US`	الإسبانية (الولايات المتحدة)	`es-US`
الفرنسية (فرنسا)	`fr-FR`	الهندية (الهند)	`hi-IN`
الإندونيسية (إندونيسيا)	`id-ID`	الإيطالية (إيطاليا)	`it-IT`
اليابانية (اليابان)	`ja-JP`	الكورية (كوريا)	`ko-KR`
البرتغالية (البرازيل)	`pt-BR`	الروسية (روسيا)	`ru-RU`
الهولندية (هولندا)	`nl-NL`	البولندية (بولندا)	`pl-PL`
التايلاندية (تايلاند)	`th-TH`	التركية (تركيا)	`tr-TR`
الفيتنامية (فيتنام)	`vi-VN`	الرومانية (رومانيا)	`ro-RO`
الأوكرانية (أوكرانيا)	`uk-UA`	البنغالية (بنغلاديش)	`bn-BD`
الإنجليزية (الهند)	حزمة `en-IN` و`hi-IN`	الماراثية (الهند)	`mr-IN`
التاميلية (الهند)	`ta-IN`	التيلوغوية (الهند)	`te-IN`

النماذج المتوافقة

النموذج	متحدّث واحد	محادثة مع عدّة أشخاص
إصدار تجريبي من Gemini 2.5 Flash لتحويل النص إلى كلام	✔️	✔️
إصدار تجريبي من ميزة "تحويل النص إلى كلام" في Gemini 2.5 Pro	✔️	✔️

القيود

يمكن لنماذج تحويل النص إلى كلام تلقّي مدخلات نصية فقط وإنشاء مخرجات صوتية.
تبلغ قدرة استيعاب السياق في جلسة تحويل النص إلى كلام 32 ألف رمز مميز.
راجِع قسم اللغات لمعرفة اللغات المتاحة.

الخطوات التالية

جرِّب كتاب وصفات إنشاء الصوت.
توفّر واجهة برمجة التطبيقات Live من Gemini خيارات تفاعلية لإنشاء الصوت يمكنك دمجها مع وسائط أخرى.
للتعرّف على كيفية التعامل مع مدخلات الصوت، يُرجى الانتقال إلى دليل فهم الصوت.