Gemma 4 ra mắt với đầu vào văn bản, âm thanh và hình ảnh, đồng thời có cửa sổ ngữ cảnh dài lên đến 256 nghìn token! Tìm hiểu thêm

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Hiểu âm thanh

Xem trên ai.google.dev

Chạy trong Google Colab

Chạy trong Kaggle

Mở trong Vertex AI

Xem nguồn trên GitHub

Bắt đầu từ Gemma 3n, bạn có thể sử dụng âm thanh trực tiếp trong lời nhắc và quy trình làm việc. Âm thanh và ngôn ngữ nói là nguồn dữ liệu phong phú để thu thập ý định của người dùng, ghi lại thông tin về thế giới xung quanh chúng ta và hiểu các vấn đề cụ thể cần giải quyết.

Hướng dẫn này cung cấp thông tin tổng quan về các tính năng xử lý âm thanh của Gemma 4, bao gồm tính năng nhận dạng lời nói tự động (ASR), dịch và hiểu lời nói chung.

Sổ tay này sẽ chạy trên GPU T4.

Cài đặt gói Python

Cài đặt các thư viện Hugging Face cần thiết để chạy mô hình Gemma và đưa ra yêu cầu.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

Tải mô hình

Sử dụng các thư viện transformers để tạo một thực thể của processor và model bằng các lớp AutoProcessor và AutoModelForImageTextToText như trong ví dụ về mã sau:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

Dữ liệu âm thanh

Dữ liệu âm thanh kỹ thuật số có thể ở nhiều định dạng và mức độ phân giải. Các định dạng âm thanh thực tế mà bạn có thể sử dụng với Gemma, chẳng hạn như định dạng MP3 và WAV, được xác định bởi khung mà bạn chọn để chuyển đổi dữ liệu âm thanh thành tensor. Dưới đây là một số điểm cụ thể cần cân nhắc khi chuẩn bị dữ liệu âm thanh để xử lý bằng Gemma:

Chi phí mã thông báo: Mỗi giây âm thanh là 25 mã thông báo cho Gemma 4. (6,25 mã thông báo cho Gemma 3n).
Độ dài đoạn âm thanh: Âm thanh hỗ trợ độ dài tối đa là 30 giây.
Kênh âm thanh: Dữ liệu âm thanh được xử lý dưới dạng một kênh âm thanh. Nếu bạn đang sử dụng âm thanh đa kênh, chẳng hạn như kênh trái và kênh phải, hãy cân nhắc giảm dữ liệu xuống một kênh bằng cách xoá các kênh hoặc kết hợp dữ liệu âm thanh thành một kênh.
Mã hoá kỹ thuật:
- Tốc độ lấy mẫu: 16 kHz
- Độ sâu bit: Định dạng dấu phẩy động 32 bit, với các mẫu được chuẩn hoá trong phạm vi [-1, 1].

Nếu dữ liệu âm thanh mà bạn định xử lý khác biệt đáng kể so với quá trình xử lý đầu vào, đặc biệt là về kênh, tốc độ lấy mẫu và độ sâu bit, hãy cân nhắc lấy mẫu lại hoặc cắt bớt dữ liệu âm thanh để khớp với độ phân giải dữ liệu do mô hình xử lý.

Mã hoá âm thanh

Mặc dù các thư viện cấp cao (chẳng hạn như AutoProcessor Hugging Face) thường tự động xử lý trước âm thanh, nhưng đôi khi bạn có thể cần triển khai mã hoá tuỳ chỉnh.

Khi mã hoá dữ liệu âm thanh bằng cách triển khai mã của riêng bạn để sử dụng với Gemma, bạn nên tuân theo quy trình chuyển đổi được đề xuất. Nếu bạn đang làm việc với các tệp âm thanh được mã hoá ở một định dạng cụ thể, chẳng hạn như dữ liệu được mã hoá MP3 hoặc WAV, trước tiên, bạn phải giải mã các tệp này thành mẫu bằng một thư viện như ffmpeg. Sau khi giải mã dữ liệu, hãy chuyển đổi âm thanh thành dạng sóng float32 16 kHz một kênh trong phạm vi [-1, 1]. Ví dụ: nếu bạn đang làm việc với các tệp WAV số nguyên PCM 16 bit có dấu âm thanh nổi ở tần số 44, 1 kHz, hãy làm theo các bước sau:

Lấy mẫu lại dữ liệu âm thanh thành 16 kHz
Trộn từ âm thanh nổi xuống âm thanh đơn bằng cách tính trung bình 2 kênh
Chuyển đổi từ int16 sang float32 và chia cho 32768.0 để điều chỉnh theo phạm vi [-1, 1]

Lưu ý: Khi lấy mẫu lại âm thanh thành 16 kHz, bạn nên sử dụng phương thức Fourier để có kết quả tốt nhất, chẳng hạn như scipy.signal.resample hoặc librosa.sample(res_type ='scipy').

Chuyển lời nói thành văn bản

Gemma 4 E2B, E4B và 12B Unified được huấn luyện để nhận dạng lời nói đa ngôn ngữ, cho phép bạn chép lời đầu vào âm thanh bằng nhiều ngôn ngữ thành văn bản.

Sử dụng cấu trúc lời nhắc sau cho Tính năng nhận dạng lời nói từ âm thanh (ASR).

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Các ví dụ về mã sau đây cho thấy cách nhắc mô hình chép lời văn bản từ tệp âm thanh bằng Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

Dịch lời nói tự động

Gemma 4 E2B, E4B và 12B Unified được huấn luyện cho các tác vụ dịch lời nói đa ngôn ngữ, cho phép bạn dịch trực tiếp âm thanh lời nói sang ngôn ngữ khác.

Sử dụng cấu trúc lời nhắc sau cho Tính năng dịch lời nói tự động (AST).

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

Các ví dụ về mã sau đây cho thấy cách nhắc mô hình dịch âm thanh lời nói thành văn bản bằng Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

Dịch lời nói tự động / Nhận dạng lời nói tự động

Hãy tự mình thử

pip install ipywebrtc

Nhấn vào nút hình tròn rồi bắt đầu nói. Nhấp lại vào nút hình tròn khi bạn nói xong. Tiện ích này sẽ bắt đầu phát lại ngay những gì đã thu thập được.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

Chuyển đổi tệp webm sang định dạng wav mà PyTorch có thể hiểu.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

AST

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

Tóm tắt và các bước tiếp theo

Trong hướng dẫn này, bạn đã tìm hiểu cách xử lý âm thanh bằng các mô hình Gemma 4. Các ví dụ minh hoạ cách thực hiện tính năng Chuyển lời nói thành văn bản (ASR) để chép lời nói, cũng như tính năng Dịch lời nói tự động (AST) để dịch trực tiếp âm thanh lời nói sang ngôn ngữ khác. Bạn cũng đã thấy cách thu thập âm thanh từ micrô trong môi trường sổ tay để xử lý.

Hãy xem tài liệu sau để đọc thêm.