تم إطلاق Gemma 4 مع إمكانية إدخال النصوص والصوت والصور، بالإضافة إلى قدرة استيعاب طويلة تصل إلى 256 ألف رمز مميّز. مزيد من المعلومات

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

فهم الصوت

عرض على ai.google.dev

التشغيل في Google Colab

التشغيل في Kaggle

فتح في Vertex AI

عرض المصدر على GitHub

بدءًا من نموذج Gemma 3n، يمكنك استخدام الصوت مباشرةً في طلباتك وسير عملك. يمثّل الصوت واللغة المنطوقة مصادر غنية بالبيانات التي تساعد في تحديد نوايا المستخدِمين وتسجيل معلومات عن العالم من حولنا وفهم المشاكل المحدّدة التي يجب حلّها.

يقدّم هذا الدليل نظرة عامة على إمكانات معالجة الصوت في نموذج Gemma 4، بما في ذلك ميزة "التعرّف التلقائي على الكلام" (ASR) والترجمة وفهم الكلام بشكل عام.

سيتم تشغيل دفتر الملاحظات هذا على وحدة معالجة الرسومات T4.

تثبيت حِزم Python

ثبِّت مكتبات Hugging Face المطلوبة لتشغيل نموذج Gemma وتقديم الطلبات.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

تحميل النموذج

استخدِم مكتبات transformers لإنشاء مثيل من processor وmodel باستخدام الفئتَين AutoProcessor وAutoModelForImageTextToText كما هو موضّح في مثال الرمز البرمجي التالي:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

البيانات الصوتية

يمكن أن تتخذ البيانات الصوتية الرقمية أشكالاً متعددة ومستويات دقة مختلفة. يتم تحديد تنسيقات الصوت الفعلية التي يمكنك استخدامها مع Gemma، مثل تنسيقَي MP3 وWAV، من خلال إطار العمل الذي تختاره لتحويل البيانات الصوتية إلى متّجهات متعدّدة الأبعاد. في ما يلي بعض الاعتبارات المحدّدة لإعداد البيانات الصوتية لمعالجتها باستخدام Gemma:

تكلفة الرموز المميّزة: تبلغ تكلفة كل ثانية من الصوت 25 رمزًا مميّزًا لنموذج Gemma 4. (6.25 رمز مميّز لنموذج Gemma 3n).
طول المقطع: يمكن أن يبلغ الحد الأقصى لطول الصوت 30 ثانية.
القنوات الصوتية: تتم معالجة البيانات الصوتية كقناة صوتية واحدة. إذا كنت تستخدم صوتًا متعدد القنوات، مثل القناتَين اليمنى واليسرى، ننصحك بتقليل البيانات إلى قناة واحدة عن طريق إزالة القنوات أو دمج البيانات الصوتية في قناة واحدة.
الترميز الفني:
- معدّل البيانات في الملف الصوتي: 16 كيلوهرتز
- عمق البت: تنسيق النقطة العائمة 32 بت، مع تطبيع العيّنات ضمن النطاق [-1, 1]

إذا كانت البيانات الصوتية التي تخطط لمعالجتها مختلفة بشكل كبير عن معالجة الإدخال، لا سيما من حيث القنوات ومعدّل البيانات في الملف الصوتي وعمق البت، ننصحك بإعادة أخذ عيّنات من بياناتك الصوتية أو قصّها لتتطابق مع دقة البيانات التي يعالجها النموذج.

ترميز الصوت

في حين أنّ المكتبات عالية المستوى (مثل AutoProcessor من Hugging Face) غالبًا ما تعالج المعالجة المسبقة للصوت تلقائيًا، قد تحتاج أحيانًا إلى تنفيذ ترميز مخصّص.

عند ترميز البيانات الصوتية باستخدام تنفيذ الرمز البرمجي الخاص بك لاستخدامها مع Gemma، عليك اتّباع عملية التحويل المقترَحة. إذا كنت تعمل على ملفات صوتية تم ترميزها بتنسيق معيّن، مثل البيانات المرمَّزة بتنسيق MP3 أو WAV، عليك أولاً فك ترميز هذه الملفات إلى عيّنات باستخدام مكتبة مثل ffmpeg. بعد فك ترميز البيانات، حوِّل الصوت إلى موجات أحادية القناة بتنسيق float32 بمعدّل 16 كيلوهرتز في النطاق [-1, 1]. على سبيل المثال، إذا كنت تعمل على ملفات WAV بتنسيق PCM عدد صحيح 16 بت ستيريو عند 44.1 كيلوهرتز، اتّبِع الخطوات التالية:

أعِد أخذ عيّنات من البيانات الصوتية بمعدّل 16 كيلوهرتز
اخفض مستوى الصوت من ستيريو إلى أحادي عن طريق حساب متوسط القناتَين
حوِّل من int16 إلى float32، واقسم على 32768.0 لتغيير المقياس إلى النطاق [-1, 1]

ملاحظة: عند إعادة أخذ عيّنات من الصوت بمعدّل 16 كيلوهرتز، ننصحك باستخدام طريقة فورييه للحصول على أفضل النتائج، مثل scipy.signal.resample أو librosa.sample(res_type ='scipy').

تحويل الكلام إلى نص

تم تدريب نماذج Gemma 4 E2B وE4B و12B Unified على التعرّف على الكلام بعدة لغات، ما يتيح لك تحويل الإدخال الصوتي بلغات مختلفة إلى نص.

استخدِم بنية الطلب التالية لميزة التعرّف التلقائي على الكلام (ASR).

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

توضّح أمثلة الرموز البرمجية التالية كيفية توجيه النموذج لتحويل النص من الملفات الصوتية باستخدام Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

الترجمة التلقائية للكلام

تم تدريب نماذج Gemma 4 E2B وE4B و12B Unified على مهام ترجمة الكلام بعدة لغات، ما يتيح لك ترجمة محتوى كلامي صوتي مباشرةً إلى لغة أخرى.

استخدِم بنية الطلب التالية لميزة الترجمة التلقائية للكلام (AST).

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

توضّح أمثلة الرموز البرمجية التالية كيفية توجيه النموذج لترجمة الصوت المنطوق إلى نص باستخدام Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

الترجمة التلقائية للكلام / التعرّف التلقائي على الكلام

يمكنك تجربة هذا بنفسك

pip install ipywebrtc

اضغط على الزر الدائري وابدأ التحدّث. انقر على الزر الدائري مرة أخرى عند الانتهاء. ستبدأ الأداة على الفور تشغيل ما تم التقاطه.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

حوِّل ملف webm إلى تنسيق wav الذي يمكن أن يفهمه PyTorch.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

الترجمة التلقائية للكلام (AST)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

الملخّص والخطوات التالية

في هذا الدليل، تعرّفت على كيفية معالجة الصوت باستخدام نماذج Gemma 4. أوضحت الأمثلة كيفية إجراء ميزة "تحويل الكلام إلى نص" (ASR) لتحويل اللغة المنطوقة إلى نص، بالإضافة إلى ميزة "الترجمة التلقائية للكلام" (AST) لترجمة محتوى كلامي صوتي مباشرةً إلى لغة أخرى. تعرّفت أيضًا على كيفية التقاط الصوت من ميكروفون في بيئة دفتر ملاحظات لمعالجته.

يمكنك الاطّلاع على المستندات التالية لمزيد من المعلومات.