Gemma 4 è stato rilasciato con input di testo, audio e immagini e una finestra contestuale lunga fino a 256.000 token. Scopri di più

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Comprensione dell'audio

Visualizza su ai.google.dev

Esegui in Google Colab

Esegui in Kaggle

Apri in Vertex AI

Visualizza il codice sorgente su GitHub

A partire da Gemma 3n, puoi utilizzare l'audio direttamente nei prompt e nei workflow. L'audio e la lingua parlata sono ricche fonti di dati per acquisire le intenzioni degli utenti, registrare informazioni sul mondo che ci circonda e comprendere problemi specifici da risolvere.

Questa guida fornisce una panoramica delle funzionalità di elaborazione audio di Gemma 4, tra cui il riconoscimento vocale automatico (ASR), la traduzione e la comprensione generale del parlato.

Questo notebook verrà eseguito sulla GPU T4.

Installa i pacchetti Python

Installa le librerie Hugging Face necessarie per eseguire il modello Gemma ed effettuare richieste.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

Carica modello

Utilizza le librerie transformers per creare un'istanza di processor e model utilizzando le classi AutoProcessor e AutoModelForImageTextToText come mostrato nel seguente esempio di codice:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

Dati audio

I dati audio digitali possono essere disponibili in molti formati e livelli di risoluzione. I formati audio effettivi che puoi utilizzare con Gemma, come MP3 e WAV, sono determinati dal framework che scegli per convertire i dati audio in tensori. Ecco alcune considerazioni specifiche per la preparazione dei dati audio per l'elaborazione con Gemma:

Costo dei token:ogni secondo di audio corrisponde a 25 token per Gemma 4 (6,25 token per Gemma 3n).
Durata clip:l'audio supporta una durata massima di 30 secondi.
Canali audio:i dati audio vengono elaborati come un unico canale audio. Se utilizzi audio multicanale, ad esempio canali sinistro e destro, valuta la possibilità di ridurre i dati a un singolo canale rimuovendo i canali o combinando i dati audio in un unico canale.
Codifica tecnica:
- Frequenza di campionamento: 16 kHz
- Profondità di bit:formato float a 32 bit, con campioni normalizzati nell'intervallo [-1, 1].

Se i dati audio che prevedi di elaborare sono significativamente diversi dall'elaborazione dell'input, in particolare in termini di canali, frequenza di campionamento e profondità di bit, valuta la possibilità di ricampionare o tagliare i dati audio in modo che corrispondano alla risoluzione dei dati gestita dal modello.

Codifica audio

Sebbene le librerie di alto livello (come Hugging Face AutoProcessor) spesso gestiscano automaticamente la preelaborazione audio, a volte potrebbe essere necessario implementare una codifica personalizzata.

Quando codifichi i dati audio con la tua implementazione di codice per l'utilizzo con Gemma, devi seguire la procedura di conversione consigliata. Se lavori con file audio codificati in un formato specifico, ad esempio dati codificati in MP3 o WAV, devi prima decodificarli in campioni utilizzando una libreria come ffmpeg. Una volta decodificati i dati, converti l'audio in forme d'onda mono-canale a 16 kHz float32 nell'intervallo [-1, 1]. Ad esempio, se lavori con file WAV PCM stereo a 16 bit con segno a 44,1 kHz, segui questi passaggi:

Ricampiona i dati audio a 16 kHz
Riduzione del mix da stereo a mono tramite la media dei due canali
Converti da int16 a float32 e dividi per 32768, 0 per scalare l'intervallo [-1, 1]

Nota: quando esegui il ricampionamento dell'audio a 16 kHz, per ottenere risultati ottimali devi utilizzare un metodo di Fourier, ad esempio scipy.signal.resample o librosa.sample(res_type ='scipy').

Speech-to-Text

Gemma 4 E2B, E4B e 12B Unified sono addestrati per il riconoscimento vocale multilingue, il che ti consente di trascrivere l'input audio in varie lingue in testo.

Utilizza la seguente struttura del prompt per il riconoscimento vocale automatico (ASR).

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

I seguenti esempi di codice mostrano come richiedere al modello di trascrivere il testo dai file audio utilizzando Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

Traduzione vocale automatica

I modelli Gemma 4 E2B, E4B e 12B Unified sono addestrati per attività di traduzione vocale multilingue, il che ti consente di tradurre l'audio parlato direttamente in un'altra lingua.

Utilizza la seguente struttura del prompt per la traduzione automatica del parlato (AST).

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

I seguenti esempi di codice mostrano come richiedere al modello di tradurre l'audio parlato in testo utilizzando Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

Traduzione vocale automatica / Riconoscimento vocale automatico

Prova a farlo da solo

pip install ipywebrtc

Premi il pulsante circolare e inizia a parlare. Al termine, fai di nuovo clic sul pulsante circolare. Il widget inizierà immediatamente a riprodurre ciò che ha registrato.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

Converti il file webm nel formato wav comprensibile a PyTorch.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

AST

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

Riepilogo e passaggi successivi

In questa guida hai imparato a elaborare l'audio utilizzando i modelli Gemma 4. Gli esempi hanno mostrato come eseguire la sintesi vocale (ASR) per trascrivere la lingua parlata, nonché la traduzione vocale automatica (AST) per tradurre l'audio parlato direttamente in un'altra lingua. Hai anche visto come acquisire l'audio da un microfono in un ambiente notebook per l'elaborazione.

Per ulteriori informazioni, consulta la seguente documentazione.