Gemma 3n เปิดตัวพร้อมอินพุตเสียงและเพิ่มประสิทธิภาพเพื่อใช้ในอุปกรณ์ทั่วไป ดูข้อมูลเพิ่มเติม

หน้านี้ได้รับการแปลโดย Cloud Translation API

ปรับแต่ง EmbeddingGemma

ดูที่ ai.google.dev

เรียกใช้ใน Google Colab

เรียกใช้ใน Kaggle

เปิดใน Vertex AI

ดูแหล่งข้อมูลใน GitHub

การปรับแต่งจะช่วยลดช่องว่างระหว่างความเข้าใจแบบอเนกประสงค์ของโมเดลกับความแม่นยำเฉพาะทางที่มีประสิทธิภาพสูงซึ่งแอปพลิเคชันของคุณต้องการ เนื่องจากไม่มีโมเดลใดที่เหมาะกับทุกงาน การปรับแต่งจึงช่วยปรับโมเดลให้เข้ากับโดเมนเฉพาะของคุณ

สมมติว่าบริษัท "Shibuya Financial" ของคุณมีผลิตภัณฑ์ทางการเงินที่ซับซ้อนหลากหลาย เช่น ทรัสต์เพื่อการลงทุน บัญชี NISA (บัญชีออมทรัพย์ที่ได้รับสิทธิประโยชน์ทางภาษี) และสินเชื่อที่อยู่อาศัย ทีมสนับสนุนลูกค้าใช้ฐานความรู้ภายในเพื่อค้นหาคำตอบสำหรับคำถามของลูกค้าได้อย่างรวดเร็ว

ตั้งค่า

ก่อนเริ่มบทแนะนำนี้ ให้ทำตามขั้นตอนต่อไปนี้

รับสิทธิ์เข้าถึง EmbeddingGemma โดยเข้าสู่ระบบ Hugging Face แล้วเลือกรับทราบสัญญาอนุญาตสำหรับโมเดล Gemma
สร้างโทเค็นเพื่อการเข้าถึงของ Hugging Face แล้วใช้เพื่อเข้าสู่ระบบจาก Colab

สมุดบันทึกนี้จะทำงานบน CPU หรือ GPU

ติดตั้งแพ็กเกจ Python

ติดตั้งไลบรารีที่จำเป็นสำหรับการเรียกใช้โมเดล EmbeddingGemma และการสร้างการฝัง Sentence Transformers เป็นเฟรมเวิร์ก Python สำหรับการฝังข้อความและรูปภาพ ดูข้อมูลเพิ่มเติมได้ในเอกสารประกอบของ Sentence Transformers

pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

หลังจากยอมรับใบอนุญาตแล้ว คุณจะต้องมีโทเค็น Hugging Face ที่ถูกต้องเพื่อเข้าถึงโมเดล

# Login into Hugging Face Hub
from huggingface_hub import login
login()

โหลดโมเดล

ใช้ไลบรารี sentence-transformers เพื่อสร้างอินสแตนซ์ของคลาสโมเดลด้วย EmbeddingGemma

import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "google/embeddinggemma-300M"
model = SentenceTransformer(model_id).to(device=device)

print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))

Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696

เตรียมชุดข้อมูลการปรับแต่ง

ซึ่งเป็นส่วนที่สำคัญที่สุด คุณต้องสร้างชุดข้อมูลที่จะสอนโมเดลว่า "คล้ายกัน" หมายถึงอะไรในบริบทเฉพาะของคุณ โดยข้อมูลนี้มักจะจัดโครงสร้างเป็น 3 สิ่งต่อไปนี้ (Anchor, Positive, Negative)

Anchor: คำค้นหาหรือประโยคเดิม
เชิงบวก: ประโยคที่มีความหมายคล้ายกันมากหรือเหมือนกับประโยคหลัก
เชิงลบ: ประโยคที่อยู่ในหัวข้อที่เกี่ยวข้องแต่มีความหมายแตกต่างกัน

ในตัวอย่างนี้ เราเตรียมไว้เพียง 3 รายการ แต่สำหรับการใช้งานจริง คุณจะต้องมีชุดข้อมูลที่ใหญ่กว่านี้มากเพื่อให้ทำงานได้ดี

from datasets import Dataset

dataset = [
    ["How do I open a NISA account?", "What is the procedure for starting a new tax-free investment account?", "I want to check the balance of my regular savings account."],
    ["Are there fees for making an early repayment on a home loan?", "If I pay back my house loan early, will there be any costs?", "What is the management fee for this investment trust?"],
    ["What is the coverage for medical insurance?", "Tell me about the benefits of the health insurance plan.", "What is the cancellation policy for my life insurance?"],
]

# Convert the list-based dataset into a list of dictionaries.
data_as_dicts = [ {"anchor": row[0], "positive": row[1], "negative": row[2]} for row in dataset ]

# Create a Hugging Face `Dataset` object from the list of dictionaries.
train_dataset = Dataset.from_list(data_as_dicts)
print(train_dataset)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 3
})

ก่อนการปรับแต่ง

การค้นหา "การลงทุนแบบปลอดภาษี" อาจแสดงผลลัพธ์ต่อไปนี้พร้อมคะแนนความคล้ายกัน

เอกสาร: การเปิดบัญชี NISA (คะแนน: 0.51)
เอกสาร: การเปิดบัญชีออมทรัพย์ทั่วไป (คะแนน: 0.50) <- คะแนนคล้ายกัน อาจทำให้สับสน
เอกสาร: คู่มือการสมัครสินเชื่อบ้าน (คะแนน: 0.44)

หมายเหตุ: หากต้องการสร้างการฝังที่เหมาะสมที่สุดด้วย EmbeddingGemma คุณควรเพิ่ม "พรอมต์คำสั่ง" หรือ "งาน" ที่จุดเริ่มต้นของข้อความอินพุต คุณจะใช้ STS สำหรับความคล้ายคลึงของประโยค ดูรายละเอียดเกี่ยวกับพรอมต์ EmbeddingGemma ทั้งหมดที่มีได้ที่การ์ดโมเดล

task_name = "STS"

def get_scores(query, documents):
  # Calculate embeddings by calling model.encode()
  query_embeddings = model.encode(query, prompt_name=task_name)
  doc_embeddings = model.encode(documents, prompt_name=task_name)

  # Calculate the embedding similarities
  similarities = model.similarity(query_embeddings, doc_embeddings)

  for idx, doc in enumerate(documents):
    print("Document: ", doc, "-> 🤖 Score: ", similarities.numpy()[0][idx])

query = "I want to start a tax-free installment investment, what should I do?"
documents = ["Opening a NISA Account", "Opening a Regular Savings Account", "Home Loan Application Guide"]

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.51571906
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.5035889
Document:  Home Loan Application Guide -> 🤖 Score:  0.4406476

การฝึกอบรม

การใช้เฟรมเวิร์กอย่าง sentence-transformers ใน Python จะช่วยให้โมเดลพื้นฐานค่อยๆ เรียนรู้ความแตกต่างเล็กๆ น้อยๆ ในคำศัพท์ทางการเงินของคุณ

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="my-embedding-gemma",
    # Optional training parameters:
    prompts=model.prompts[task_name],    # use model's prompt to train
    num_train_epochs=5,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    # Optional tracking/debugging parameters:
    logging_steps=train_dataset.num_rows,
    report_to="none",
)

class MyCallback(TrainerCallback):
    "A callback that evaluates the model at the end of eopch"
    def __init__(self, evaluate):
        self.evaluate = evaluate # evaluate function

    def on_log(self, args, state, control, **kwargs):
        # Evaluate the model using text generation
        print(f"Step {state.global_step} finished. Running evaluation:")
        self.evaluate()

def evaluate():
  get_scores(query, documents)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[MyCallback(evaluate)]
)
trainer.train()

Step 3 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.6459116
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.42690125
Document:  Home Loan Application Guide -> 🤖 Score:  0.40419024
Step 6 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.68530923
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3611964
Document:  Home Loan Application Guide -> 🤖 Score:  0.40812016
Step 9 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.7168733
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3449782
Document:  Home Loan Application Guide -> 🤖 Score:  0.44477722
Step 12 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73008573
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34124148
Document:  Home Loan Application Guide -> 🤖 Score:  0.4676212
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
TrainOutput(global_step=15, training_loss=0.009651267528511198, metrics={'train_runtime': 195.3004, 'train_samples_per_second': 0.077, 'train_steps_per_second': 0.077, 'total_flos': 0.0, 'train_loss': 0.009651267528511198, 'epoch': 5.0})

หลังการปรับแต่ง

ตอนนี้การค้นหาเดียวกันนี้จะแสดงผลลัพธ์ที่ชัดเจนขึ้นมาก

เอกสาร: การเปิดบัญชี NISA (คะแนน: 0.73) <- มั่นใจมากขึ้น
เอกสาร: การเปิดบัญชีออมทรัพย์ปกติ (คะแนน: 0.34) <- เกี่ยวข้องน้อยกว่าอย่างเห็นได้ชัด
เอกสาร: คู่มือการสมัครสินเชื่อบ้าน (คะแนน: 0.47)

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752

หากต้องการอัปโหลดโมเดลไปยัง Hugging Face Hub คุณสามารถใช้วิธี push_to_hub จากไลบรารี Sentence Transformers

การอัปโหลดโมเดลช่วยให้คุณเข้าถึงโมเดลเพื่อการอนุมานได้ง่ายๆ จากฮับโดยตรง แชร์กับผู้อื่น และควบคุมเวอร์ชันของงานได้ เมื่ออัปโหลดแล้ว ทุกคนจะโหลดโมเดลของคุณได้ด้วยโค้ดเพียงบรรทัดเดียว เพียงแค่ดูรหัสโมเดลที่ไม่ซ้ำกัน <username>/my-embedding-gemma

# Push to Hub
model.push_to_hub("my-embedding-gemma")

สรุปและขั้นตอนถัดไป

ตอนนี้คุณได้เรียนรู้วิธีปรับโมเดล EmbeddingGemma ให้เหมาะกับโดเมนที่เฉพาะเจาะจงโดยการปรับแต่งด้วยไลบรารี Sentence Transformers แล้ว

ดูสิ่งที่คุณทำได้อีกด้วย EmbeddingGemma

ภาพรวมการฝึกในเอกสารประกอบของ Sentence Transformers
สร้างการฝังด้วย Sentence Transformers
ตัวอย่าง RAG แบบง่ายในตำราอาหารของ Gemma