| | গুগল কোলাবে চালান | | | গিটহাবে উৎস দেখুন |
Gemma 3n থেকে শুরু করে, আপনি আপনার প্রম্পট এবং ওয়ার্কফ্লোতে সরাসরি অডিও ব্যবহার করতে পারেন। ব্যবহারকারীর অভিপ্রায় অনুধাবন করতে, আমাদের চারপাশের জগৎ সম্পর্কে তথ্য সংগ্রহ করতে এবং সমাধানযোগ্য নির্দিষ্ট সমস্যাগুলো বুঝতে অডিও এবং কথ্য ভাষা হলো তথ্যের সমৃদ্ধ উৎস।
এই নির্দেশিকাটি জেমা ৪- এর অডিও প্রক্রিয়াকরণ ক্ষমতাগুলোর একটি সংক্ষিপ্ত বিবরণ প্রদান করে, যার মধ্যে স্বয়ংক্রিয় বক্তৃতা শনাক্তকরণ (ASR), অনুবাদ এবং সাধারণ বক্তৃতা অনুধাবন অন্তর্ভুক্ত রয়েছে।
এই নোটবুকটি টি৪ জিপিইউ-তে চলবে।
পাইথন প্যাকেজ ইনস্টল করুন
জেমা মডেল চালানো এবং অনুরোধ পাঠানোর জন্য প্রয়োজনীয় হাগিং ফেস লাইব্রেরিগুলো ইনস্টল করুন।
# Install PyTorch & other librariespip install torch accelerate# Install the transformers librarypip install transformers
লোড মডেল
নিম্নলিখিত কোড উদাহরণে দেখানো অনুযায়ী transformers লাইব্রেরি ব্যবহার করে AutoProcessor এবং AutoModelForImageTextToText ক্লাসগুলোর সাহায্যে একটি processor এবং model ইনস্ট্যান্স তৈরি করুন:
MODEL_ID = "google/gemma-4-E2B-it" # @param ["google/gemma-4-E2B-it","google/gemma-4-E4B-it", "google/gemma-4-31B-it", "google/gemma-4-26B-A4B-it"]
from transformers import AutoProcessor, AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)
Loading weights: 0%| | 0/2011 [00:00<?, ?it/s]
অডিও ডেটা
ডিজিটাল অডিও ডেটা বিভিন্ন ফরম্যাট এবং রেজোলিউশনের স্তরে আসতে পারে। জেমা-র সাথে আপনি ঠিক কোন অডিও ফরম্যাটগুলো ব্যবহার করতে পারবেন, যেমন MP3 এবং WAV ফরম্যাট, তা নির্ভর করে সাউন্ড ডেটাকে টেনসরে রূপান্তর করার জন্য আপনার বেছে নেওয়া ফ্রেমওয়ার্কের উপর। জেমা দিয়ে প্রসেসিংয়ের জন্য অডিও ডেটা প্রস্তুত করার ক্ষেত্রে এখানে কিছু নির্দিষ্ট বিবেচ্য বিষয় উল্লেখ করা হলো:
- টোকেন খরচ: জেমা ৪-এর জন্য প্রতি সেকেন্ড অডিওর মূল্য ২৫ টোকেন। (জেমা ৩এন-এর জন্য ৬.২৫ টোকেন)।
- ক্লিপের দৈর্ঘ্য: অডিও সর্বোচ্চ ৩০ সেকেন্ড পর্যন্ত সমর্থন করে।
- অডিও চ্যানেল: অডিও ডেটা একটি একক অডিও চ্যানেল হিসেবে প্রক্রিয়াজাত করা হয়। আপনি যদি মাল্টি-চ্যানেল অডিও, যেমন বাম এবং ডান চ্যানেল ব্যবহার করেন, তবে চ্যানেল অপসারণ করে বা সাউন্ড ডেটা একটি একক চ্যানেলে একত্রিত করে ডেটা কমিয়ে আনার কথা বিবেচনা করতে পারেন।
- প্রযুক্তিগত এনকোডিং:
- স্যাম্পল রেট: ১৬ কিলোহার্টজ, ৩২ মিলিসেকেন্ড ফ্রেম ব্যবহার করে।
- বিট ডেপথ: ৩২-বিট ফ্লোট ফরম্যাট, যেখানে স্যাম্পলগুলো [-১, ১] সীমার মধ্যে স্বাভাবিক করা হয়েছে।
আপনি যে অডিও ডেটা প্রসেস করার পরিকল্পনা করছেন, তা যদি ইনপুট প্রসেসিং থেকে উল্লেখযোগ্যভাবে ভিন্ন হয়, বিশেষ করে চ্যানেল, স্যাম্পল রেট এবং বিট ডেপথের দিক থেকে, তাহলে মডেল দ্বারা পরিচালিত ডেটা রেজোলিউশনের সাথে মেলানোর জন্য আপনার অডিও ডেটা রিস্যাম্পলিং বা ট্রিম করার কথা বিবেচনা করুন।
অডিও এনকোডিং
যদিও উচ্চ-স্তরের লাইব্রেরিগুলো (যেমন Hugging Face AutoProcessor ) প্রায়শই স্বয়ংক্রিয়ভাবে অডিও প্রিপ্রসেসিং করে থাকে, তবুও কখনও কখনও আপনার কাস্টম এনকোডিং প্রয়োগ করার প্রয়োজন হতে পারে।
জেমা-তে ব্যবহারের জন্য আপনার নিজস্ব কোড ইমপ্লিমেন্টেশন দিয়ে অডিও ডেটা এনকোড করার সময়, আপনার প্রস্তাবিত রূপান্তর প্রক্রিয়াটি অনুসরণ করা উচিত। আপনি যদি MP3 বা WAV এনকোডেড ডেটার মতো কোনো নির্দিষ্ট ফরম্যাটে এনকোড করা অডিও ফাইল নিয়ে কাজ করেন, তবে আপনাকে প্রথমে ffmpeg মতো কোনো লাইব্রেরি ব্যবহার করে সেগুলোকে স্যাম্পলে ডিকোড করতে হবে। ডেটা ডিকোড হয়ে গেলে, অডিওটিকে [-1, 1] রেঞ্জের মধ্যে মনো-চ্যানেল, 16 kHz float32 ওয়েভফর্মে রূপান্তর করুন। উদাহরণস্বরূপ, আপনি যদি 44.1 kHz-এর স্টেরিও সাইনড 16-বিট PCM ইন্টিজার WAV ফাইল নিয়ে কাজ করেন, তাহলে এই ধাপগুলো অনুসরণ করুন:
- অডিও ডেটা ১৬ কিলোহার্টজে রিস্যাম্পল করুন
- দুটি চ্যানেলের গড় করে স্টেরিও থেকে মনোতে ডাউনমিক্স করুন।
- int16 থেকে float32-এ রূপান্তর করুন, এবং [-1, 1] পরিসরে স্কেল করার জন্য 32768.0 দিয়ে ভাগ করুন।
স্পিচ টু টেক্সট
জেমা ৪ ই২বি এবং ই৪বি বহুভাষিক স্পিচ রিকগনিশনের জন্য প্রশিক্ষিত, যা আপনাকে বিভিন্ন ভাষার অডিও ইনপুটকে টেক্সটে রূপান্তর করতে দেয়। নিম্নলিখিত কোড উদাহরণগুলি দেখায় কিভাবে হাগিং ফেস ট্রান্সফর্মার ব্যবহার করে অডিও ফাইল থেকে টেক্সট রূপান্তর করার জন্য মডেলটিকে নির্দেশ দিতে হয়:
RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
#{"type": "text", "text": "Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
]
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
<bos><|turn>user Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer: * Only output the transcription, with no newlines. * When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|><turn|> <|turn>model I woke up early today feeling really fresh the morning light was beautiful and I enjoyed a nice cup of coffee<turn|>
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Give me a concise overview of these audio files."},
{"type": "text", "text": "journal1:"},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal1.wav"},
{"type": "text", "text": "journal2:"},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal2.wav"},
{"type": "text", "text": "journal3:"},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal3.wav"},
{"type": "text", "text": "journal4:"},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal4.wav"},
{"type": "text", "text": "journal5:"},
{"type": "audio", "audio": f"{RESOURCE_URL_PREFIX}journal5.wav"},
]
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=1024)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
<bos><|turn>user Give me a concise overview of these audio files.journal1:<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|>journal2:<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|>journal3:<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|>journal4:<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|>journal5:<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|><turn|> <|turn>model Here is a concise overview of the audio files: **Journal 1:** The speaker felt refreshed, enjoyed a morning ride, a cup of coffee, and was generally happy. **Journal 2:** The speaker spent the afternoon at the park, which was a perfect day for a walk, and enjoyed watching the cherry blossoms. **Journal 3:** The speaker finished the day with a good book, feeling grateful for simple moments and ready for more. **Journal 4:** The speaker returned from work, admiring the sunset, and enjoyed a clear view from the train. **Journal 5:** The speaker had a great lunch with an old friend, enjoyed catching up, and felt happy about the day.<turn|>
স্বয়ংক্রিয় বক্তৃতা অনুবাদ
জেমা ৪ ই২বি এবং ই৪বি বহুভাষিক স্পিচ ট্রান্সলেশন টাস্কের জন্য প্রশিক্ষিত, যা আপনাকে কথ্য অডিও সরাসরি অন্য ভাষায় অনুবাদ করার সুযোগ দেয়। নিম্নলিখিত কোড উদাহরণগুলি দেখায় কিভাবে হাগিং ফেস ট্রান্সফর্মার ব্যবহার করে কথ্য অডিওকে টেক্সটে অনুবাদ করার জন্য মডেলকে নির্দেশ দিতে হয়:
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
{"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
]
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
<bos><|turn>user Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean.<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|><turn|> <|turn>model Roses are red, violets are blue. Korean: 장미는 빨갛고, 제비꽃은 파랗다.<turn|>
স্বয়ংক্রিয় বক্তৃতা অনুবাদ / স্বয়ংক্রিয় বক্তৃতা শনাক্তকরণ
এটা নিজে চেষ্টা করে দেখুন
pip install ipywebrtcবৃত্তাকার বোতামটি টিপুন এবং কথা বলা শুরু করুন। আপনার কথা শেষ হলে আবার বৃত্তাকার বোতামটি ক্লিক করুন। উইজেটটি সাথে সাথে যা ধারণ করেছে তা বাজানো শুরু করবে।
from google.colab import output
output.enable_custom_widget_manager()
from ipywebrtc import AudioRecorder, CameraStream
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder
AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …
webm ফাইলকে wav ফরম্যাটে রূপান্তর করুন যা PyTorch বুঝতে পারে।
with open('/content/recording.webm', 'wb') as f:
f.write(recorder.audio.value)
!ffmpeg -i /content/recording.webm /content/recording.wav -y
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 70.100 / 56. 70.100
libavcodec 58.134.100 / 58.134.100
libavformat 58. 76.100 / 58. 76.100
libavdevice 58. 13.100 / 58. 13.100
libavfilter 7.110.100 / 7.110.100
libswscale 5. 9.100 / 5. 9.100
libswresample 3. 9.100 / 3. 9.100
libpostproc 55. 9.100 / 55. 9.100
Input #0, matroska,webm, from '/content/recording.webm':
Metadata:
encoder : Chrome
Duration: 00:00:04.02, start: 0.000000, bitrate: 131 kb/s
Stream #0:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Stream mapping:
Stream #0:0 -> #0:0 (opus (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/content/recording.wav':
Metadata:
ISFT : Lavf58.76.100
Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
Metadata:
encoder : Lavc58.134.100 pcm_s16le
size= 383kB time=00:00:04.01 bitrate= 779.7kbits/s speed=60.6x
video:0kB audio:382kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.019914%
এএসআর
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
{"type": "audio", "audio": "/content/recording.wav"},
]
}]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
<bos><|turn>user Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer: * Only output the transcription, with no newlines. * When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|><turn|> <|turn>model How can I get to the station?<turn|>
এএসটি
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean."},
{"type": "audio", "audio": "/content/recording.wav"},
]
}]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True, return_dict=True,
return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)
outputs = model.generate(**input_ids, max_new_tokens=64)
text = processor.batch_decode(
outputs,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
print(text[0])
<bos><|turn>user Transcribe the following speech segment in English, then translate it into Korean. When formatting the answer, first output the transcription in English, then one newline, then output the string 'Korean: ', then the translation in Korean.<|audio><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><|audio|><audio|><turn|> <|turn>model How can I get to the station? Korean: 역에 어떻게 가나요?<turn|>
সারসংক্ষেপ এবং পরবর্তী পদক্ষেপ
এই নির্দেশিকায়, আপনি জেমা ৪ মডেল ব্যবহার করে অডিও প্রসেস করার পদ্ধতি শিখেছেন। উদাহরণগুলোতে দেখানো হয়েছে কীভাবে কথ্য ভাষা প্রতিলিপি করার জন্য স্পিচ-টু-টেক্সট (ASR) এবং কথ্য অডিওকে সরাসরি অন্য ভাষায় অনুবাদ করার জন্য অটোমেটেড স্পিচ ট্রান্সলেশন (AST) ব্যবহার করতে হয়। এছাড়াও, প্রসেসিংয়ের জন্য নোটবুক পরিবেশে মাইক্রোফোন থেকে কীভাবে অডিও ক্যাপচার করতে হয়, তাও আপনি দেখেছেন।
আরও বিস্তারিত জানার জন্য নিম্নলিখিত ডকুমেন্টেশনটি দেখুন।
গুগল কোলাবে চালান
গিটহাবে উৎস দেখুন