Gemma 4 เปิดตัวพร้อมอินพุตข้อความ เสียง และรูปภาพ รวมถึงหน้าต่างบริบทแบบยาวที่มีโทเค็นให้ถึง 2.56 แสนโทเค็น ดูข้อมูลเพิ่มเติม

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

ความเข้าใจเกี่ยวกับเสียง

ดูใน ai.google.dev

เรียกใช้ใน Google Colab

เรียกใช้ใน Kaggle

เปิดใน Vertex AI

ดูซอร์สโค้ดใน GitHub

เมื่อใช้ Gemma 3n ขึ้นไป คุณจะใช้เสียงในพรอมต์และเวิร์กโฟลว์ได้โดยตรง เสียงและภาษาพูดเป็นแหล่งข้อมูลที่สำคัญสำหรับการจับความตั้งใจของผู้ใช้ การบันทึกข้อมูลเกี่ยวกับโลกรอบตัวเรา และการทำความเข้าใจปัญหาที่เฉพาะเจาะจงเพื่อแก้ไข

คู่มือนี้ให้ภาพรวมของความสามารถในการประมวลผลเสียงของ Gemma 4 ซึ่งรวมถึงการรู้จำคำพูดอัตโนมัติ (ASR) การแปลภาษา และความเข้าใจคำพูดทั่วไป

โน้ตบุ๊กนี้จะทำงานบน GPU T4

ติดตั้งแพ็กเกจ Python

ติดตั้งไลบรารี Hugging Face ที่จำเป็นสำหรับการเรียกใช้โมเดล Gemma และทำการขอ

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

โหลดโมเดล

ใช้ไลบรารี transformers เพื่อสร้างอินสแตนซ์ของ processor และ model โดยใช้คลาส AutoProcessor และ AutoModelForImageTextToText ดังที่แสดงในตัวอย่างโค้ดต่อไปนี้

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

ข้อมูลเสียง

ข้อมูลเสียงดิจิทัลมีหลายรูปแบบและระดับความละเอียด รูปแบบเสียงจริงที่คุณใช้กับ Gemma ได้ เช่น รูปแบบ MP3 และ WAV จะกำหนดโดยเฟรมเวิร์กที่คุณเลือกเพื่อแปลงข้อมูลเสียงเป็นเทนเซอร์ ต่อไปนี้คือข้อควรพิจารณาที่เฉพาะเจาะจงสำหรับการเตรียมข้อมูลเสียงเพื่อประมวลผลด้วย Gemma

ค่าใช้จ่ายโทเค็น: เสียง 1 วินาทีมีค่าใช้จ่าย 25 โทเค็นสำหรับ Gemma 4 (6.25 โทเค็นสำหรับ Gemma 3n)
ความยาวคลิป: เสียงมีความยาวสูงสุด 30 วินาที
ช่องเสียง: ระบบจะประมวลผลข้อมูลเสียงเป็นช่องเสียงเดียว หากคุณใช้เสียงหลายช่อง เช่น ช่องซ้ายและขวา ให้พิจารณาลดข้อมูลเป็นช่องเดียวโดยนำช่องออกหรือรวมข้อมูลเสียงเป็นช่องเดียว
การเข้ารหัสทางเทคนิค:
- อัตราการสุ่มตัวอย่าง: 16 kHz
- ความลึกบิต: รูปแบบทศนิยม 32 บิต โดยมีการปรับตัวอย่างให้เป็นปกติในช่วง [-1, 1]

หากข้อมูลเสียงที่คุณวางแผนจะประมวลผลแตกต่างจากการประมวลผลอินพุตอย่างมาก โดยเฉพาะในแง่ของช่อง อัตราการสุ่มตัวอย่าง และความลึกบิต ให้พิจารณาสุ่มตัวอย่างใหม่หรือตัดข้อมูลเสียงให้ตรงกับความละเอียดของข้อมูลที่โมเดลจัดการ

การเข้ารหัสเสียง

แม้ว่าไลบรารีระดับสูง (เช่น AutoProcessor ของ Hugging Face) มักจะจัดการการประมวลผลล่วงหน้าของเสียงโดยอัตโนมัติ แต่บางครั้งคุณอาจต้องใช้การเข้ารหัสที่กำหนดเอง

เมื่อเข้ารหัสข้อมูลเสียงด้วยการใช้งานโค้ดของคุณเองเพื่อใช้กับ Gemma คุณควรทำตามกระบวนการแปลงที่แนะนำ หากคุณใช้ไฟล์เสียงที่เข้ารหัสในรูปแบบที่เฉพาะเจาะจง เช่น ข้อมูลที่เข้ารหัส MP3 หรือ WAV คุณต้องถอดรหัสไฟล์เหล่านี้เป็นตัวอย่างก่อนโดยใช้ไลบรารี เช่น ffmpeg เมื่อถอดรหัสข้อมูลแล้ว ให้แปลงเสียงเป็นรูปคลื่น float32 16 kHz แบบช่องเดียวในช่วง [-1, 1] ตัวอย่างเช่น หากคุณใช้ไฟล์ WAV จำนวนเต็ม PCM 16 บิตแบบสเตอริโอที่ลงนามที่ 44.1 kHz ให้ทำตามขั้นตอนต่อไปนี้

สุ่มตัวอย่างข้อมูลเสียงใหม่เป็น 16 kHz
ดาวน์มิกซ์จากสเตอริโอเป็นโมโนโดยหาค่าเฉลี่ยของ 2 ช่อง
แปลงจาก int16 เป็น float32 แล้วหารด้วย 32768.0 เพื่อปรับขนาดให้อยู่ในช่วง [-1, 1]

หมายเหตุ: เมื่อสุ่มตัวอย่างเสียงใหม่เป็น 16 kHz คุณควรใช้วิธีฟูริเยร์เพื่อให้ได้ผลลัพธ์ที่ดีที่สุด เช่น scipy.signal.resample หรือ librosa.sample(res_type ='scipy').

การแปลงเสียงพูดเป็นข้อความ

Gemma 4 E2B, E4B และ 12B Unified ได้รับการฝึกให้จดจำคำพูดหลายภาษา ซึ่งช่วยให้คุณถอดเสียงอินพุตเสียงในภาษาต่างๆ เป็นข้อความได้

ใช้โครงสร้างพรอมต์ต่อไปนี้สำหรับการรู้จำคำพูดอัตโนมัติ (ASR)

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

ตัวอย่างโค้ดต่อไปนี้แสดงวิธีพรอมต์โมเดลให้ถอดเสียงจากไฟล์เสียงเป็นข้อความโดยใช้ Hugging Face Transformers

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

การแปลเสียงพูดอัตโนมัติ

Gemma 4 E2B, E4B และ 12B Unified ได้รับการฝึกให้แปลคำพูดหลายภาษา ซึ่งช่วยให้คุณแปลเสียงพูดเป็นภาษาอื่นได้โดยตรง

ใช้โครงสร้างพรอมต์ต่อไปนี้สำหรับการแปลคำพูดอัตโนมัติ (AST)

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

ตัวอย่างโค้ดต่อไปนี้แสดงวิธีพรอมต์โมเดลให้แปลเสียงพูดเป็นข้อความโดยใช้ Hugging Face Transformers

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

การแปลคำพูดอัตโนมัติ / การรู้จำคำพูดอัตโนมัติ

ลองทำด้วยตัวเอง

pip install ipywebrtc

กดปุ่มวงกลมแล้วเริ่มพูด คลิกปุ่มวงกลมอีกครั้งเมื่อพูดเสร็จแล้ว วิดเจ็ตจะเริ่มเล่นสิ่งที่บันทึกไว้ทันที

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

แปลงไฟล์ webm เป็นรูปแบบ wav ที่ PyTorch เข้าใจ

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

AST

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

สรุปและขั้นตอนถัดไป

ในคู่มือนี้ คุณได้เรียนรู้วิธีประมวลผลเสียงโดยใช้โมเดล Gemma 4 ตัวอย่างแสดงวิธีใช้การแปลงเสียงพูดเป็นข้อความ (ASR) เพื่อถอดเสียงภาษาพูด รวมถึงการแปลคำพูดอัตโนมัติ (AST) เพื่อแปลเสียงพูดเป็นภาษาอื่นโดยตรง นอกจากนี้ คุณยังได้เห็นวิธีบันทึกเสียงจากไมโครโฟนในสภาพแวดล้อมของโน้ตบุ๊กเพื่อประมวลผล

โปรดดูเอกสารประกอบต่อไปนี้เพื่ออ่านเพิ่มเติม