جما ۴ با ورودی متن، صدا و تصویر و پنجره متنی با ظرفیت تا ۲۵۶ هزار دلار منتشر شد! اطلاعات بیشتر

این صفحه به‌وسیله ‏Cloud Translation API‏ ترجمه شده است.

درک صوتی

مشاهده در ai.google.dev

در گوگل کولب اجرا کنید

دویدن در کاگل

باز کردن در Vertex AI

مشاهده منبع در گیت‌هاب

با شروع از Gemma 3n ، می‌توانید مستقیماً از صدا در اعلان‌ها و گردش‌های کاری خود استفاده کنید. صدا و زبان گفتاری منابع غنی از داده‌ها برای ثبت اهداف کاربر، ثبت اطلاعات در مورد جهان اطراف ما و درک مشکلات خاصی هستند که باید حل شوند.

این راهنما مروری بر قابلیت‌های پردازش صوتی Gemma 4 ، از جمله تشخیص خودکار گفتار (ASR)، ترجمه و درک عمومی گفتار، ارائه می‌دهد.

این نوت‌بوک از پردازنده گرافیکی T4 بهره می‌برد.

نصب بسته‌های پایتون

کتابخانه‌های Hugging Face مورد نیاز برای اجرای مدل Gemma و ارسال درخواست‌ها را نصب کنید.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

مدل بار

از کتابخانه‌های transformers برای ایجاد یک نمونه از یک processor و model با استفاده از کلاس‌های AutoProcessor و AutoModelForImageTextToText همانطور که در مثال کد زیر نشان داده شده است، استفاده کنید:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

داده‌های صوتی

داده‌های صوتی دیجیتال می‌توانند در قالب‌ها و سطوح وضوح مختلفی ارائه شوند. قالب‌های صوتی واقعی که می‌توانید با Gemma استفاده کنید، مانند قالب‌های MP3 و WAV، توسط چارچوبی که برای تبدیل داده‌های صوتی به تانسورها انتخاب می‌کنید، تعیین می‌شوند. در اینجا چند نکته خاص برای آماده‌سازی داده‌های صوتی برای پردازش با Gemma آورده شده است:

هزینه توکن: هر ثانیه صدا برای Gemma 4، 25 توکن است. (برای Gemma 3n، 6.25 توکن).
طول کلیپ: صدا حداکثر از 30 ثانیه پشتیبانی می‌کند.
کانال‌های صوتی: داده‌های صوتی به صورت یک کانال صوتی واحد پردازش می‌شوند. اگر از صدای چند کاناله، مانند کانال‌های چپ و راست، استفاده می‌کنید، با حذف کانال‌ها یا ترکیب داده‌های صوتی در یک کانال واحد، کاهش داده‌ها به یک کانال واحد را در نظر بگیرید.
کدگذاری فنی:
- نرخ نمونه‌برداری: ۱۶ کیلوهرتز
- عمق بیت: فرمت اعشاری ۳۲ بیتی، با نمونه‌های نرمال‌شده در محدوده [-۱، ۱].

اگر داده‌های صوتی که قصد پردازش آنها را دارید، به ویژه از نظر کانال‌ها، نرخ نمونه‌برداری و عمق بیت، تفاوت قابل توجهی با پردازش ورودی دارند، نمونه‌برداری مجدد یا برش داده‌های صوتی خود را برای مطابقت با وضوح داده‌های مدیریت شده توسط مدل در نظر بگیرید.

رمزگذاری صوتی

اگرچه کتابخانه‌های سطح بالا (مانند Hugging Face AutoProcessor ) اغلب پیش‌پردازش صدا را به صورت خودکار انجام می‌دهند، اما گاهی اوقات ممکن است نیاز به پیاده‌سازی کدگذاری سفارشی داشته باشید.

هنگام رمزگذاری داده‌های صوتی با پیاده‌سازی کد خودتان برای استفاده با Gemma، باید فرآیند تبدیل توصیه‌شده را دنبال کنید. اگر با فایل‌های صوتی رمزگذاری‌شده در فرمت خاصی مانند داده‌های رمزگذاری‌شده MP3 یا WAV کار می‌کنید، ابتدا باید آن‌ها را با استفاده از کتابخانه‌ای مانند ffmpeg به نمونه‌هایی رمزگشایی کنید. پس از رمزگشایی داده‌ها، صدا را به شکل موج‌های تک کاناله، ۱۶ کیلوهرتز float32 در محدوده [-1، ۱] تبدیل کنید. به عنوان مثال، اگر با فایل‌های WAV عدد صحیح PCM 16 بیتی با علامت استریو در ۴۴.۱ کیلوهرتز کار می‌کنید، این مراحل را دنبال کنید:

داده‌های صوتی را به ۱۶ کیلوهرتز تغییر نمونه دهید
با میانگین‌گیری از دو کانال، میکس را از استریو به مونو کاهش دهید
از int16 به float32 تبدیل کنید و بر 32768.0 تقسیم کنید تا به محدوده [-1، 1] برسد.

نکته: هنگام نمونه‌برداری مجدد صدا به ۱۶ کیلوهرتز، برای بهترین نتیجه باید از یک روش فوریه مانند scipy.signal.resample یا librosa.sample(res_type ='scipy') استفاده کنید.

گفتار به متن

نرم‌افزارهای Gemma 4 E2B، E4B و 12B Unified برای تشخیص گفتار چندزبانه آموزش دیده‌اند و به شما امکان می‌دهند ورودی صوتی به زبان‌های مختلف را به متن تبدیل کنید.

از ساختار دستوری زیر برای تشخیص گفتار صوتی (ASR) استفاده کنید.

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

نمونه‌های کد زیر نشان می‌دهند که چگونه می‌توان مدل را وادار کرد تا با استفاده از Hugging Face Transformers متن را از فایل‌های صوتی رونویسی کند:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

ترجمه خودکار گفتار

نرم‌افزارهای Gemma 4 E2B، E4B و 12B Unified برای وظایف ترجمه گفتاری چندزبانه آموزش دیده‌اند و به شما این امکان را می‌دهند که صدای گفتاری را مستقیماً به زبان دیگری ترجمه کنید.

از ساختار دستوری زیر برای ترجمه خودکار گفتار (AST) استفاده کنید.

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

نمونه‌های کد زیر نشان می‌دهند که چگونه می‌توان با استفاده از مبدل‌های چهره در آغوش گرفته، مدل را وادار به ترجمه صدای گفتاری به متن کرد:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

ترجمه خودکار گفتار / تشخیص خودکار گفتار

اینو خودتون امتحان کنید

pip install ipywebrtc

دکمه دایره را فشار دهید و شروع به صحبت کنید. وقتی صحبتتان تمام شد دوباره روی دکمه دایره کلیک کنید. ویجت بلافاصله شروع به پخش آنچه ضبط کرده است، خواهد کرد.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

تبدیل فایل webm به فرمت wav که PyTorch بتواند آن را بفهمد.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

عصر

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

AST

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

خلاصه و مراحل بعدی

در این راهنما، شما یاد گرفتید که چگونه صدا را با استفاده از مدل‌های Gemma 4 پردازش کنید. مثال‌ها نحوه انجام تبدیل گفتار به متن (ASR) برای رونویسی زبان گفتاری و همچنین ترجمه خودکار گفتار (AST) را برای ترجمه مستقیم صدای گفتاری به زبان دیگر نشان دادند. همچنین نحوه ضبط صدا از میکروفون در محیط نوت‌بوک برای پردازش را مشاهده کردید.

برای مطالعه بیشتر به مستندات زیر مراجعه کنید.