מודל חדש של Gemini Native Audio זמין דרך Live API. אפשר לנסות אותו בחינם ב-Google AI Studio.

דף זה תורגם על ידי Cloud Translation API.

יצירת דיבור (המרת טקסט לדיבור)

בעזרת Gemini API אפשר להפוך קלט טקסט לאודיו עם דובר אחד או כמה דוברים באמצעות יכולות מובנות של המרת טקסט לדיבור (TTS). הפקת המרת טקסט לדיבור (TTS) היא ניתנת לשליטה, כלומר אפשר להשתמש בשפה טבעית כדי לבנות אינטראקציות ולהנחות את הסגנון, המבטא, הקצב והטון של האודיו.

יכולת ה-TTS שונה מיצירת דיבור שמתבצעת באמצעות Live API, שנועד לאודיו אינטראקטיבי ולא מובנה, ולתשומות ולתפוקות מולטימודאליות. ‫Live API מצטיין בהקשרים דינמיים של שיחות, אבל TTS דרך Gemini API מותאם לתרחישים שבהם נדרשת הקראה מדויקת של טקסט עם שליטה מדויקת בסגנון ובצליל, כמו יצירה של פודקאסט או ספר אודיו.

במדריך הזה מוסבר איך ליצור אודיו עם קריינות של דובר אחד או כמה דוברים מטקסט.

לפני שמתחילים

חשוב לוודא שאתם משתמשים בגרסה של מודל Gemini 2.5 עם יכולות מובְנות של המרת טקסט לדיבור (TTS), כמו שמופיע ברשימה שבקטע מודלים נתמכים. כדי לקבל תוצאות אופטימליות, כדאי לבחור את המודל שהכי מתאים לתרחיש השימוש הספציפי שלכם.

מומלץ לבדוק את מודלי ה-TTS של Gemini 2.5 ב-AI Studio לפני שמתחילים לפתח.

המרת טקסט לדיבור של דובר יחיד

כדי להמיר טקסט לאודיו עם דובר אחד, מגדירים את אופן התגובה ל'אודיו' ומעבירים אובייקט SpeechConfig עם הגדרה של VoiceConfig. צריך לבחור שם קול מתוך הקולות המובנים לפלט.

בדוגמה הזו, האודיו שנוצר על ידי המודל נשמר בקובץ wave:

Python

from google import genai
from google.genai import types
import wave

# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)

client = genai.Client()

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents="Say cheerfully: Have a wonderful day!",
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name='Kore',
            )
         )
      ),
   )
)

data = response.candidates[0].content.parts[0].inline_data.data

file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory

JavaScript

import {GoogleGenAI} from '@google/genai';
import wav from 'wav';

async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });

      writer.on('finish', resolve);
      writer.on('error', reject);

      writer.write(pcmData);
      writer.end();
   });
}

async function main() {
   const ai = new GoogleGenAI({});

   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: 'Say cheerfully: Have a wonderful day!' }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               voiceConfig: {
                  prebuiltVoiceConfig: { voiceName: 'Kore' },
               },
            },
      },
   });

   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');

   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}
await main();

REST

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "contents": [{
          "parts":[{
            "text": "Say cheerfully: Have a wonderful day!"
          }]
        }],
        "generationConfig": {
          "responseModalities": ["AUDIO"],
          "speechConfig": {
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }
        },
        "model": "gemini-2.5-flash-preview-tts",
    }' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
          base64 --decode >out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

המרת טקסט לדיבור עם כמה דוברים

כדי להשתמש באודיו עם כמה רמקולים, צריך אובייקט MultiSpeakerVoiceConfig עם כל רמקול (עד 2) שמוגדר כ-SpeakerVoiceConfig. צריך להגדיר כל speaker עם אותם שמות שבהם השתמשתם בהנחיה:

Python

from google import genai
from google.genai import types
import wave

# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)

client = genai.Client()

prompt = """TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?"""

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=prompt,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Joe',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Jane',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)

data = response.candidates[0].content.parts[0].inline_data.data

file_name='out.wav'
wave_file(file_name, data) # Saves the file to current directory

JavaScript

import {GoogleGenAI} from '@google/genai';
import wav from 'wav';

async function saveWaveFile(
   filename,
   pcmData,
   channels = 1,
   rate = 24000,
   sampleWidth = 2,
) {
   return new Promise((resolve, reject) => {
      const writer = new wav.FileWriter(filename, {
            channels,
            sampleRate: rate,
            bitDepth: sampleWidth * 8,
      });

      writer.on('finish', resolve);
      writer.on('error', reject);

      writer.write(pcmData);
      writer.end();
   });
}

async function main() {
   const ai = new GoogleGenAI({});

   const prompt = `TTS the following conversation between Joe and Jane:
         Joe: How's it going today Jane?
         Jane: Not too bad, how about you?`;

   const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-preview-tts",
      contents: [{ parts: [{ text: prompt }] }],
      config: {
            responseModalities: ['AUDIO'],
            speechConfig: {
               multiSpeakerVoiceConfig: {
                  speakerVoiceConfigs: [
                        {
                           speaker: 'Joe',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Kore' }
                           }
                        },
                        {
                           speaker: 'Jane',
                           voiceConfig: {
                              prebuiltVoiceConfig: { voiceName: 'Puck' }
                           }
                        }
                  ]
               }
            }
      }
   });

   const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;
   const audioBuffer = Buffer.from(data, 'base64');

   const fileName = 'out.wav';
   await saveWaveFile(fileName, audioBuffer);
}

await main();

REST

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-tts:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
  "contents": [{
    "parts":[{
      "text": "TTS the following conversation between Joe and Jane:
                Joe: Hows it going today Jane?
                Jane: Not too bad, how about you?"
    }]
  }],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "multiSpeakerVoiceConfig": {
        "speakerVoiceConfigs": [{
            "speaker": "Joe",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Kore"
              }
            }
          }, {
            "speaker": "Jane",
            "voiceConfig": {
              "prebuiltVoiceConfig": {
                "voiceName": "Puck"
              }
            }
          }]
      }
    }
  },
  "model": "gemini-2.5-flash-preview-tts",
}' | jq -r '.candidates[0].content.parts[0].inlineData.data' | \
    base64 --decode > out.pcm
# You may need to install ffmpeg.
ffmpeg -f s16le -ar 24000 -ac 1 -i out.pcm out.wav

שליטה בסגנון הדיבור באמצעות הנחיות

אתם יכולים לשלוט בסגנון, בטון, במבטא ובקצב באמצעות הנחיות בשפה טבעית, גם בהמרת טקסט לדיבור עם דובר אחד וגם עם כמה דוברים. לדוגמה, בהנחיה עם דובר אחד, אפשר לומר:

Say in an spooky whisper:
"By the pricking of my thumbs...
Something wicked this way comes"

בהנחיה עם כמה דוברים, צריך לספק למודל את השם של כל דובר ואת התמליל המתאים. אפשר גם לספק הנחיות לכל דובר בנפרד:

Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:

Speaker1: So... what's on the agenda today?
Speaker2: You're never going to guess!

כדי להדגיש את הסגנון או הרגש שרוצים להעביר, אפשר לנסות להשתמש באפשרות קולית שמתאימה להם. בהנחיה הקודמת, לדוגמה, יכול להיות שהצליל הנשימתי של אנקלדוס ידגיש את המילים 'עייף' ו'משועמם', בעוד שהטון העליז של יכול להשלים את המילים 'נרגש' ו'שמח'.

המערכת יוצרת הנחיה להמרה לאודיו

מודלים של TTS מוציאים רק אודיו, אבל אפשר להשתמש במודלים אחרים כדי ליצור תמליל קודם, ואז להעביר את התמליל הזה למודל ה-TTS כדי שיקרא אותו בקול.

Python

from google import genai
from google.genai import types

client = genai.Client()

transcript = client.models.generate_content(
   model="gemini-2.0-flash",
   contents="""Generate a short transcript around 100 words that reads
            like it was clipped from a podcast by excited herpetologists.
            The hosts names are Dr. Anya and Liam.""").text

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=transcript,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Dr. Anya',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Liam',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)

# ...Code to stream or save the output

JavaScript

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({});

async function main() {

const transcript = await ai.models.generateContent({
   model: "gemini-2.0-flash",
   contents: "Generate a short transcript around 100 words that reads like it was clipped from a podcast by excited herpetologists. The hosts names are Dr. Anya and Liam.",
   })

const response = await ai.models.generateContent({
   model: "gemini-2.5-flash-preview-tts",
   contents: transcript,
   config: {
      responseModalities: ['AUDIO'],
      speechConfig: {
         multiSpeakerVoiceConfig: {
            speakerVoiceConfigs: [
                   {
                     speaker: "Dr. Anya",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Kore"},
                     }
                  },
                  {
                     speaker: "Liam",
                     voiceConfig: {
                        prebuiltVoiceConfig: {voiceName: "Puck"},
                    }
                  }
                ]
              }
            }
      }
  });
}
// ..JavaScript code for exporting .wav file for output audio

await main();

אפשרויות קול

מודלים של TTS תומכים ב-30 אפשרויות הקול הבאות בשדה voice_name:

Zephyr -- Bright	‫Puck – Upbeat	‫Charon – אינפורמטיבי
Kore -- Firm	‫Fenrir – נרגש	‫Leda – צעיר
‫Orus -- Firm	‫Aoede – Breezy	‫Callirrhoe -- Easy-going
Autonoe -- Bright	Enceladus -- Breathy	‫Iapetus – Clear
Umbriel – נינוח	Algieba -- Smooth	‫Despina – Smooth
‫Erinome -- Clear	‫Algenib – מחוספס	Rasalgethi -- Informative
‫Laomedeia -- שמח	Achernar -- Soft	‫Alnilam -- Firm
‫Schedar – Even	‫Gacrux -- Mature	‫Pulcherrima -- Forward
Achird -- Friendly	‫Zubenelgenubi – שגרתי	‫Vindemiatrix -- Gentle
Sadachbia -- Lively	Sadaltager -- Knowledgeable	‫Sulafat -- חמה

אפשר לשמוע את כל האפשרויות של הקולות ב-AI Studio.

שפות נתמכות

מודלים של TTS מזהים את שפת הקלט באופן אוטומטי. הם תומכים ב-24 השפות הבאות:

שפה	קוד BCP-47	שפה	קוד BCP-47
ערבית (מצרית)	`ar-EG`	גרמנית (גרמניה)	`de-DE`
אנגלית (ארה"ב)	`en-US`	ספרדית (ארצות הברית)	`es-US`
צרפתית (צרפת)	`fr-FR`	הינדית (הודו)	`hi-IN`
אינדונזית (אינדונזיה)	`id-ID`	איטלקית (איטליה)	`it-IT`
יפנית (יפן)	`ja-JP`	קוריאנית (קוריאה)	`ko-KR`
פורטוגזית (ברזיל)	`pt-BR`	רוסית (רוסיה)	`ru-RU`
הולנדית (הולנד)	`nl-NL`	פולנית (פולין)	`pl-PL`
תאילנדית (תאילנד)	`th-TH`	טורקית (טורקיה)	`tr-TR`
וייטנאמית (וייטנאם)	`vi-VN`	רומנית (רומניה)	`ro-RO`
אוקראינית (אוקראינה)	`uk-UA`	בנגלית (בנגלדש)	`bn-BD`
אנגלית (הודו)	חבילה של `en-IN` ושל `hi-IN`	מראטהית (הודו)	`mr-IN`
טמילית (הודו)	`ta-IN`	טלוגו (הודו)	`te-IN`

מודלים נתמכים

דגם	דובר יחיד	מערכת רמקולים
תצוגה מקדימה של Gemini ‎2.5 Flash TTS	✔️	✔️
תצוגה מקדימה של Gemini ‎2.5 Pro TTS	✔️	✔️

מגבלות

מודלים של TTS יכולים לקבל רק קלט טקסט ולהפיק פלט אודיו.
לסשן TTS יש מגבלה של 32,000 טוקנים בחלון ההקשר.
בקטע שפות מפורטות השפות הנתמכות.

מדריך לכתיבת הנחיות

מודל Gemini Native Audio Generation Text-to-Speech (TTS) שונה ממודלים מסורתיים של TTS בכך שהוא מבוסס על מודל שפה גדול (LLM) שיודע לא רק מה לומר, אלא גם איך לומר את זה.

כדי להשתמש ביכולת הזו, המשתמשים יכולים לדמיין שהם במאים שמגדירים סצנה לקריין וירטואלי. כדי ליצור הנחיה, מומלץ להשתמש ברכיבים הבאים: פרופיל אודיו שמגדיר את הזהות הבסיסית והארכיטיפ של הדמות, תיאור סצנה שמגדיר את הסביבה הפיזית ואת האווירה הרגשית, והערות הבמאי שמציעות הנחיות מדויקות יותר לגבי סגנון, מבטא ושליטה בקצב.

המשתמשים יכולים להשתמש במודל כדי ליצור ביצועים אודיו דינמיים, טבעיים ומלאי הבעה, באמצעות מתן הוראות מפורטות כמו מבטא אזורי מדויק, מאפיינים פרא-לשוניים ספציפיים (למשל, נשימה) או קצב. כדי להשיג ביצועים אופטימליים, מומלץ שההנחיות לבימוי יתאימו לתמליל, כך שההנחיה 'מי אומר את זה' תתאים להנחיות 'מה נאמר' ו'איך זה נאמר'.

מטרת המדריך הזה היא לספק הנחיות בסיסיות ולעורר רעיונות לפיתוח חוויות אודיו באמצעות יצירת אודיו ב-Gemini TTS. אנחנו סקרנים לראות מה תיצרו!

מבנה ההנחיות

הנחיה חזקה כוללת באופן אידיאלי את הרכיבים הבאים, שמשולבים יחד כדי ליצור ביצועים מצוינים:

פרופיל אודיו – הגדרת פרסונה לקול, כולל זהות הדמות, ארכיטיפ ומאפיינים אחרים כמו גיל, רקע וכו'.
סצנה – מגדירה את הבמה. מתאר את הסביבה הפיזית ואת האווירה.
הערות הבמאי – הנחיות לגבי הביצועים שבהן אפשר לפרט אילו הוראות חשובות לכישרון הווירטואלי. דוגמאות: סגנון, נשימה, קצב, הבעה ומבטא.
הקשר לדוגמה – מספק למודל נקודת התחלה הקשרית, כך שהשחקן הווירטואלי ייכנס לסצנה שהגדרתם באופן טבעי.
‫Transcript (תמליל) – הטקסט שהמודל יקריא. כדי להשיג את הביצועים הכי טובים, חשוב לזכור שהנושא של התמליל וסגנון הכתיבה צריכים להיות קשורים להוראות שאתם נותנים.

דוגמה להנחיה מלאה:

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
Jaz is standing up, not sitting, bouncing on the balls of their heels to the
rhythm of a thumping backing track. Their hands fly across the faders on a
massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake
up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is
always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated
vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks
with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead
air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any
script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
Yes, massive vibes in the studio! You are locked in and it is absolutely
popping off in London right now. If you're stuck on the tube, or just sat
there pretending to work... stop it. Seriously, I see you. Turn this up!
We've got the project roadmap landing in three, two... let's go!

שיטות מפורטות ליצירת הנחיות

בואו נפרט כל רכיב בהנחיה.

פרופיל אודיו

תאר בקצרה את הפרסונה של הדמות.

שם. כדאי לתת לדמות שם כדי להקנות למודל בסיס טוב יותר ולשפר את הביצועים. כשמגדירים את הסצנה וההקשר, כדאי להתייחס לדמות בשם שלה.
תפקיד. הזהות העיקרית והארכיטיפ של הדמות שמופיעה בסצנה. לדוגמה: תקליטן ברדיו, פודקאסטר, כתב חדשות וכו'.

דוגמאות:

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

# AUDIO PROFILE: Monica A.
## "The Beauty Influencer"

סצינה

מגדירים את ההקשר של הסצנה, כולל המיקום, האווירה ופרטים סביבתיים שיוצרים את הטון והאווירה. תאר מה קורה סביב הדמות ואיך זה משפיע עליה. הסצנה מספקת את ההקשר הסביבתי לכל האינטראקציה ומנחה את הביצוע של השחקנים בצורה עדינה ואורגנית.

דוגמאות:

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
Jaz is standing up, not sitting, bouncing on the balls of their heels to the
rhythm of a thumping backing track. Their hands fly across the faders on a
massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to
wake up an entire nation.

## THE SCENE: Homegrown Studio
A meticulously sound-treated bedroom in a suburban home. The space is
deadened by plush velvet curtains and a heavy rug, but there is a
distinct "proximity effect."

הערות הבמאי

בקטע החשוב הזה מופיעות הנחיות ספציפיות לשיפור הביצועים. אפשר לדלג על כל שאר הרכיבים, אבל מומלץ לכלול את הרכיב הזה.

חשוב להגדיר רק את מה שחשוב לביצועים, ולהיזהר לא להגדיר יותר מדי. יותר מדי כללים מחמירים יגבילו את היצירתיות של המודלים, ועשויים להוביל לביצועים גרועים יותר. צריך לאזן בין תיאור התפקיד והסצנה לבין כללי הביצוע הספציפיים.

ההנחיות הנפוצות ביותר הן סגנון, קצב ומבטא, אבל המודל לא מוגבל להנחיות האלה ולא נדרש להן. אתם יכולים להוסיף הוראות בהתאמה אישית כדי לכלול פרטים נוספים שחשובים לביצועים, ולפרט כמה שצריך.

לדוגמה:

### DIRECTOR'S NOTES

Style: Enthusiastic and Sassy GenZ beauty YouTuber

Pacing: Speaks at an energetic pace, keeping up with the extremely fast, rapid
delivery influencers use in short form videos.

Accent: Southern california valley girl from Laguna Beach |

סגנון:

הגדרת הטון והסגנון של הדיבור שנוצר. כדאי לכלול דברים כמו קצבי, אנרגטי, רגוע, משועמם וכו' כדי להנחות את הביצוע. חשוב לתת תיאור מפורט ולספק כמה שיותר פרטים: "התלהבות מדבקת. ההנחיה "המאזינים צריכים להרגיש שהם חלק מאירוע קהילתי גדול ומרגש" עדיפה על "אנרגטי ונלהב".

אפשר גם לנסות מונחים פופולריים בתעשיית הקריינות, כמו "חיוך קולי". אפשר להוסיף כמה מאפייני סגנון שרוצים.

דוגמאות:

Simple Emotion

DIRECTORS NOTES
...
Style: Frustrated and angry developer who can't get the build to run.
...

יותר עומק

DIRECTORS NOTES
...
Style: Sassy GenZ beauty YouTuber, who mostly creates content for YouTube Shorts.
...

רמה למתקדמים מאוד

DIRECTORS NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is
always raised to keep the tone bright, sunny, and explicitly inviting.
*Dynamics: High projection without shouting. Punchy consonants and
elongated vowels on excitement words (e.g., "Beauuutiful morning").

מבטא:

מתארים את המבטא הרצוי. ככל שהתיאור יהיה ספציפי יותר, כך התוצאות יהיו טובות יותר. לדוגמה, אפשר להשתמש בביטוי מבטא בריטי כמו שמדברים בקרוידון, אנגליה במקום בביטוי מבטא בריטי.

דוגמאות:

### DIRECTORS NOTES
...
Accent: Southern california valley girl from Laguna Beach
...

### DIRECTORS NOTES
...
Accent: Jaz is a from Brixton, London
...

קצב:

הקצב הכללי והשינויים בקצב לאורך היצירה.

דוגמאות:

פשוט

### DIRECTORS NOTES
...
Pacing: Speak as fast as possible
...

עומק רב יותר

### DIRECTORS NOTES
...
Pacing: Speaks at a faster, energetic pace, keeping up with fast paced music.
...

רמה למתקדמים מאוד

### DIRECTORS NOTES
...
Pacing: The "Drift": The tempo is incredibly slow and liquid. Words bleed into each other. There is zero urgency.
...

רוצים לנסות?

אתם יכולים לנסות בעצמכם כמה מהדוגמאות האלה ב-AI Studio, להתנסות ב-TTS App שלנו ולתת ל-Gemini להושיב אתכם על כיסא הבמאי. כדי להפיק ביצועים קוליים מצוינים, כדאי לזכור את הטיפים הבאים:

חשוב לזכור לשמור על עקביות בכל ההנחיה – התסריט והבימוי משלימים זה את זה ויוצרים יחד ביצוע מצוין.
לא צריך לתאר כל דבר, לפעמים כדאי לתת למודל מקום למלא את הפערים כדי שהתוצאה תהיה טבעית. (Just like a talented actor)
אם אתם מרגישים תקועים, אתם יכולים לבקש מ-Gemini עזרה בכתיבת התסריט או בביצוע.

המאמרים הבאים

אפשר לנסות את המדריך ליצירת אודיו.
Live API של Gemini מציע אפשרויות אינטראקטיביות ליצירת אודיו שאפשר לשלב עם אמצעי תקשורת אחרים.
כדי לקבל מידע על עבודה עם קלט אודיו, אפשר לעיין במדריך הבנת אודיו.