Gemma 4 lançado com entrada de texto, áudio e imagem e janela de contexto longa de até 256 mil tokens! Saiba mais

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Compreensão de áudio

Acessar ai.google.dev

Executar no Google Colab

Executar no Kaggle

Abrir na Vertex AI

Ver código-fonte no GitHub

A partir do Gemma 3n, é possível usar áudio diretamente nos comandos e fluxos de trabalho. O áudio e a linguagem falada são fontes ricas de dados para capturar intenções do usuário, registrar informações sobre o mundo ao nosso redor e entender problemas específicos a serem resolvidos.

Este guia oferece uma visão geral dos recursos de processamento de áudio da Gemma 4, incluindo reconhecimento automático de fala (ASR), tradução e compreensão geral da fala.

Este notebook será executado em uma GPU T4.

Instalar pacotes Python

Instale as bibliotecas do Hugging Face necessárias para executar o modelo Gemma e fazer solicitações.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

Carregar modelo

Use as bibliotecas transformers para criar uma instância de processor e model usando as classes AutoProcessor e AutoModelForImageTextToText, conforme mostrado no exemplo de código a seguir:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

Dados de áudio

Os dados de áudio digital podem vir em vários formatos e níveis de resolução. Os formatos de áudio que você pode usar com a Gemma, como MP3 e WAV, são determinados pelo framework escolhido para converter dados de som em tensores. Confira algumas considerações específicas para preparar dados de áudio para processamento com a Gemma:

Custo do token:cada segundo de áudio custa 25 tokens para a Gemma 4. (6,25 tokens para o Gemma 3n).
Duração do clipe:o áudio pode ter no máximo 30 segundos.
Canais de áudio:os dados de áudio são processados como um único canal. Se você estiver usando áudio multicanal, como canais esquerdo e direito, considere reduzir os dados para um único canal removendo ou combinando os dados de som em um único canal.
Codificação técnica:
- Taxa de amostragem:16 kHz
- Profundidade de bits:formato de ponto flutuante de 32 bits, com amostras normalizadas no intervalo [-1, 1].

Se os dados de áudio que você planeja processar forem significativamente diferentes do processamento de entrada, principalmente em termos de canais, taxa de amostragem e profundidade de bits, considere reamostrar ou cortar os dados de áudio para corresponder à resolução de dados processada pelo modelo.

Codificação de áudio

Embora as bibliotecas de alto nível (como o Hugging Face AutoProcessor) geralmente processem o pré-processamento de áudio automaticamente, às vezes é necessário implementar uma codificação personalizada.

Ao codificar dados de áudio com sua própria implementação de código para uso com a Gemma, siga o processo de conversão recomendado. Se você estiver trabalhando com arquivos de áudio codificados em um formato específico, como dados codificados em MP3 ou WAV, primeiro decodifique-os em amostras usando uma biblioteca como ffmpeg. Depois que os dados forem decodificados, converta o áudio em formas de onda de ponto flutuante de 16 kHz de canal único float32 no intervalo [-1, 1]. Por exemplo, se você estiver trabalhando com arquivos WAV PCM de 16 bits assinados estéreo a 44,1 kHz, siga estas etapas:

Fazer uma nova amostragem dos dados de áudio para 16 kHz
Fazer o downmix de estéreo para mono calculando a média dos dois canais
Converta de int16 para float32 e divida por 32768, 0 para dimensionar para o intervalo [-1, 1].

Observação: ao fazer a reamostragem de áudio para 16 kHz, use um método de Fourier para ter os melhores resultados, como scipy.signal.resample ou librosa.sample(res_type ='scipy').

Conversão de voz em texto

Os modelos Gemma 4 E2B, E4B e 12B Unified são treinados para reconhecimento de fala multilíngue, permitindo transcrever entradas de áudio em vários idiomas para texto.

Use a seguinte estrutura de comando para o reconhecimento de fala de áudio (ASR).

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Os exemplos de código a seguir mostram como solicitar que o modelo transcreva texto de arquivos de áudio usando o Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

Tradução simultânea de fala

Os modelos Gemma 4 E2B, E4B e 12B Unified são treinados para tarefas de tradução simultânea multilíngue, permitindo que você traduza áudio falado diretamente para outro idioma.

Use a seguinte estrutura de comando para a tradução automática de voz (AST).

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

Os exemplos de código a seguir mostram como pedir ao modelo para traduzir áudio falado em texto usando o Hugging Face Transformers:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

Tradução automática de fala / Reconhecimento automático de fala

Tente fazer isso

pip install ipywebrtc

Pressione o botão circular e comece a falar. Clique no botão de círculo novamente quando terminar. O widget vai começar a reproduzir imediatamente o que foi capturado.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

Converta o arquivo webm para o formato wav que o PyTorch pode entender.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

AST

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

Resumo e próximas etapas

Neste guia, você aprendeu a processar áudio usando os modelos da Gemma 4. Os exemplos demonstraram como realizar a conversão de fala em texto (ASR) para transcrever a linguagem falada, bem como a tradução automática de fala (AST) para traduzir áudio falado diretamente para outro idioma. Você também viu como capturar áudio de um microfone em um ambiente de notebook para processamento.

Confira a documentação a seguir para mais informações.