Gemma 4 को रिलीज़ कर दिया गया है. इसमें टेक्स्ट, ऑडियो, और इमेज के ज़रिए इनपुट दिया जा सकता है. साथ ही, इसमें 2.56 लाख टोकन तक की लंबी कॉन्टेक्स्ट विंडो है! ज़्यादा जानें

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

ऑडियो को समझना

ai.google.dev पर देखें

Google Colab में चलाएं

Kaggle में चलाएं

Vertex AI में खोलें

GitHub पर सोर्स देखें

Gemma 3n से, प्रॉम्ट और वर्कफ़्लो में सीधे ऑडियो का इस्तेमाल किया जा सकता है. ऑडियो और बोली जाने वाली भाषा, डेटा के अहम सोर्स हैं. इनसे उपयोगकर्ता के इरादों को कैप्चर किया जा सकता है, हमारे आस-पास की दुनिया के बारे में जानकारी रिकॉर्ड की जा सकती है, और हल की जाने वाली खास समस्याओं को समझा जा सकता है.

इस गाइड में, Gemma 4 की ऑडियो प्रोसेसिंग की सुविधाओं के बारे में खास जानकारी दी गई है. इनमें, अपने-आप बोली पहचानने की सुविधा (एएसआर), अनुवाद, और सामान्य तौर पर बोली को समझने की सुविधा शामिल है.

यह नोटबुक, T4 जीपीयू पर चलेगी.

Python पैकेज इंस्टॉल करना

Gemma मॉडल को चलाने और अनुरोध करने के लिए, Hugging Face की ज़रूरी लाइब्रेरी इंस्टॉल करें.

# Install PyTorch & other libraries
pip install torch accelerate

# Install the transformers library
pip install "transformers>=5.10.1"

मॉडल लोड करना

transformers लाइब्रेरी का इस्तेमाल करके, AutoProcessor और AutoModelForImageTextToText क्लास का इस्तेमाल करके, processor और model का इंस्टेंस बनाएं. इसके लिए, कोड का यह उदाहरण देखें:

MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-12B-it"]

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json:   0%|          | 0.00/4.95k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/1951 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json:   0%|          | 0.00/1.69k [00:00<?, ?B/s]
chat_template.jinja:   0%|          | 0.00/17.3k [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

ऑडियो डेटा

डिजिटल ऑडियो डेटा, कई फ़ॉर्मैट और रिज़ॉल्यूशन लेवल में हो सकता है. Gemma के साथ इस्तेमाल किए जा सकने वाले ऑडियो के फ़ॉर्मैट, जैसे कि MP3 और WAV फ़ॉर्मैट, उस फ़्रेमवर्क से तय होते हैं जिसे आपने साउंड डेटा को टेंसर में बदलने के लिए चुना है. Gemma के साथ प्रोसेस करने के लिए, ऑडियो डेटा तैयार करने से जुड़ी कुछ खास बातें यहां दी गई हैं:

टोकन की लागत: Gemma 4 के लिए, हर सेकंड के ऑडियो के लिए 25 टोकन लगते हैं. (Gemma 3n के लिए 6.25 टोकन).
क्लिप की अवधि: ऑडियो की अवधि ज़्यादा से ज़्यादा 30 सेकंड हो सकती है.
ऑडियो चैनल: ऑडियो डेटा को एक ऑडियो चैनल के तौर पर प्रोसेस किया जाता है. अगर मल्टी-चैनल ऑडियो का इस्तेमाल किया जा रहा है, जैसे कि बाएं और दाएं चैनल, तो डेटा को एक चैनल में कम करने के लिए, चैनलों को हटाएं या साउंड डेटा को एक चैनल में मिलाएं.
तकनीकी एन्कोडिंग:
- सैंपल रेट: 16 किलोहर्ट्ज़
- बिट डेप्थ: 32-बिट फ़्लोट फ़ॉर्मैट. इसमें सैंपल को [-1, 1] की रेंज में सामान्य किया जाता है.

अगर प्रोसेस किया जाने वाला ऑडियो डेटा, इनपुट प्रोसेसिंग से काफ़ी अलग है, तो खास तौर पर चैनलों, सैंपल रेट, और बिट डेप्थ के मामले में, अपने ऑडियो डेटा को रीसैंपल करें या ट्रिम करें, ताकि वह मॉडल के ज़रिए हैंडल किए जाने वाले डेटा रिज़ॉल्यूशन से मेल खाए.

ऑडियो एन्कोडिंग

आम तौर पर, हाई-लेवल लाइब्रेरी (जैसे कि Hugging Face AutoProcessor) ऑडियो की प्री-प्रोसेसिंग अपने-आप करती हैं. हालांकि, कभी-कभी आपको कस्टम एन्कोडिंग लागू करनी पड़ सकती है.

Gemma के साथ इस्तेमाल करने के लिए, ऑडियो डेटा को अपने कोड के ज़रिए एन्कोड करते समय, आपको कन्वर्ज़न के सुझाए गए प्रोसेस का पालन करना चाहिए. अगर किसी खास फ़ॉर्मैट में एन्कोड की गई ऑडियो फ़ाइलों के साथ काम किया जा रहा है, जैसे कि MP3 या WAV फ़ॉर्मैट में एन्कोड किया गया डेटा, तो आपको पहले ffmpeg जैसी लाइब्रेरी का इस्तेमाल करके, इन्हें सैंपल में डिकोड करना होगा. डेटा डिकोड हो जाने के बाद, ऑडियो को मोनो-चैनल, 16 किलोहर्ट्ज़ फ़्लोट32 वेवफ़ॉर्म में [-1, 1] की रेंज में बदलें. उदाहरण के लिए, अगर 44.1 किलोहर्ट्ज़ पर स्टीरियो साइंड 16-बिट पीसीएम इंटिजर WAV फ़ाइलों के साथ काम किया जा रहा है, तो यह तरीका अपनाएं:

ऑडियो डेटा को 16 किलोहर्ट्ज़ पर रीसैंपल करें
दो चैनलों का औसत निकालकर, स्टीरियो से मोनो में डाउनमिक्स करें
int16 से float32 में बदलें और [-1, 1] की रेंज में स्केल करने के लिए, 32768.0 से भाग दें

ध्यान दें: ऑडियो को 16 किलोहर्ट्ज़ पर रीसैंपल करते समय, बेहतर नतीजों के लिए फ़ोरियर तरीके का इस्तेमाल करें. जैसे, scipy.signal.resample या librosa.sample(res_type ='scipy').

बोली को लिखाई में बदलना

Gemma 4 E2B, E4B, और 12B Unified को कई भाषाओं में बोली की पहचान करने के लिए ट्रेन किया गया है. इससे, अलग-अलग भाषाओं में ऑडियो इनपुट को टेक्स्ट में ट्रांसक्रिप्ट किया जा सकता है.

ऑडियो से बोली की पहचान (एएसआर) के लिए, प्रॉम्ट का यह स्ट्रक्चर इस्तेमाल करें.

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

कोड के इन उदाहरणों में, Hugging Face Transformers का इस्तेमाल करके, ऑडियो फ़ाइलों से टेक्स्ट को ट्रांसक्रिप्ट करने के लिए मॉडल को प्रॉम्ट करने का तरीका दिखाया गया है:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            #{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 1024
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Give me a concise overview of these audio files."},
            {"type": "text", "text": "journal1:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
            {"type": "text", "text": "journal2:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
            {"type": "text", "text": "journal3:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
            {"type": "text", "text": "journal4:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
            {"type": "text", "text": "journal5:"},
            {"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Here is a concise overview of each audio file:

**journal1:** The speaker describes a fresh and peaceful day, enjoying a cup of coffee.
**journal2:** The speaker had a perfect day at the park, including a walk and watching cherry blossoms.
**journal3:** The speaker finished the day with a good book, feeling grateful for simple moments.
**journal4:** The speaker returned from work and noted the beautiful night sky and a clear view from the train.
**journal5:** The speaker had a great lunch with an old friend, which was a pleasant way to catch up and made their day.
<turn|>

बातचीत का अनुवाद अपने-आप होने की सुविधा

Gemma 4 E2B, E4B, और 12B Unified को कई भाषाओं में बातचीत का अनुवाद करने के लिए ट्रेन किया गया है. इससे, ऑडियो कॉन्टेंट को सीधे किसी दूसरी भाषा में अनुवाद किया जा सकता है.

बोली का अनुवाद अपने-आप होने की सुविधा (एएसटी) के लिए, प्रॉम्ट का यह स्ट्रक्चर इस्तेमाल करें.

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

कोड के इन उदाहरणों में, Hugging Face Transformers का इस्तेमाल करके, ऑडियो कॉन्टेंट को टेक्स्ट में अनुवाद करने के लिए मॉडल को प्रॉम्प्ट करने का तरीका दिखाया गया है:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

Roses are red, violets are blue.
Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>

बोली का अनुवाद अपने-आप होने की सुविधा / बोली की पहचान अपने-आप होने की सुविधा

इसे खुद आज़माएं

pip install ipywebrtc

सर्कल बटन दबाएं और बोलना शुरू करें. जब आपकी बात पूरी हो जाए, तो सर्कल बटन पर फिर से क्लिक करें. विजेट, कैप्चर की गई चीज़ को तुरंत वापस चलाने लगेगा.

from google.colab import output
output.enable_custom_widget_manager()

from ipywebrtc import AudioRecorder, CameraStream

camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

webm फ़ाइल को wav फ़ॉर्मैट में बदलें, जिसे PyTorch समझ सके.

with open('/content/recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, matroska,webm, from '/content/recording.webm':
  Metadata:
    encoder         : Chrome
  Duration: 00:00:03.00, start: 0.000000, bitrate: 132 kb/s
  Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
  Metadata:
    ISFT            : Lavf58.76.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
    Metadata:
      encoder         : Lavc58.134.100 pcm_s16le
size=     287kB time=00:00:02.99 bitrate= 783.7kbits/s speed=79.4x    
video:0kB audio:287kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.026552%

ASR

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 64
gen_kwargs = dict(generation_config=config)

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?<turn|>

एएसटी

messages = [{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
    {"type": "audio", "audio": "/content/recording.wav"},
  ]
}]

outputs = pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)
print(outputs[0]['generated_text'])

How can I get to the station?
Korean: 역에 어떻게 가나요?<turn|>

खास जानकारी और अगले चरण

इस गाइड में, Gemma 4 मॉडल का इस्तेमाल करके ऑडियो को प्रोसेस करने का तरीका बताया गया है. उदाहरणों में, बोली जाने वाली भाषा को ट्रांसक्रिप्ट करने के लिए, बोली को लिखाई में बदलने की सुविधा (एएसआर) का इस्तेमाल करने का तरीका दिखाया गया है. साथ ही, ऑडियो कॉन्टेंट को सीधे किसी दूसरी भाषा में अनुवाद करने के लिए, बोली का अनुवाद अपने-आप होने की सुविधा (एएसटी) का इस्तेमाल करने का तरीका भी दिखाया गया है. आपने यह भी देखा कि प्रोसेसिंग के लिए, नोटबुक एनवायरमेंट में माइक्रोफ़ोन से ऑडियो कैसे कैप्चर किया जाता है.

ज़्यादा जानकारी के लिए, यह दस्तावेज़ देखें.