|  在 ai.google.dev 上查看 |  在 Google Colab 中執行 |  在 Kaggle 中執行 |  |  在 GitHub 上查看來源 | 
微調可縮小模型一般用途理解能力與應用程式所需專業高效準確度之間的差距。由於沒有任何單一模型能完美執行所有工作,因此微調功能可讓模型適應特定領域。
假設貴公司「澀谷金融」提供各種複雜的金融產品,例如投資信託、NISA 帳戶 (享有稅務優惠的儲蓄帳戶) 和房屋貸款。客戶支援團隊會使用內部知識庫,快速找到客戶問題的解答。
設定
開始本教學課程前,請先完成下列步驟:
- 登入 Hugging Face,然後選取 Gemma 模型的「Acknowledge license」(確認授權),即可存取 EmbeddingGemma。
- 產生 Hugging Face 存取權杖,並使用該權杖從 Colab 登入。
這個筆記本會在 CPU 或 GPU 上執行。
安裝 Python 套件
安裝執行 EmbeddingGemma 模型及產生嵌入內容所需的程式庫。Sentence Transformers 是適用於文字和圖像嵌入的 Python 架構。詳情請參閱 Sentence Transformers 說明文件。
pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview接受授權後,您需要有效的 Hugging Face 權杖才能存取模型。
# Login into Hugging Face Hub
from huggingface_hub import login
login()
載入模型
使用 sentence-transformers 程式庫,透過 EmbeddingGemma 建立模型類別的執行個體。
import torch
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "google/embeddinggemma-300M"
model = SentenceTransformer(model_id).to(device=device)
print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))
Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696
準備微調資料集
這是最重要的部分。您需要建立資料集,讓模型瞭解在特定情境中「相似」的意義。這類資料通常會以三元組的形式呈現:(錨點、正向、負向)
- 錨點:原始查詢或句子。
- 正面:與錨點語意非常相似或完全相同的句子。
- 負面:與相關主題有關,但在語意上有所區別的句子。
在本例中,我們只準備了 3 個三元組,但如果是實際應用程式,您需要更大的資料集才能獲得良好成效。
from datasets import Dataset
dataset = [
    ["How do I open a NISA account?", "What is the procedure for starting a new tax-free investment account?", "I want to check the balance of my regular savings account."],
    ["Are there fees for making an early repayment on a home loan?", "If I pay back my house loan early, will there be any costs?", "What is the management fee for this investment trust?"],
    ["What is the coverage for medical insurance?", "Tell me about the benefits of the health insurance plan.", "What is the cancellation policy for my life insurance?"],
]
# Convert the list-based dataset into a list of dictionaries.
data_as_dicts = [ {"anchor": row[0], "positive": row[1], "negative": row[2]} for row in dataset ]
# Create a Hugging Face `Dataset` object from the list of dictionaries.
train_dataset = Dataset.from_list(data_as_dicts)
print(train_dataset)
Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 3
})
微調前
搜尋「免稅投資」可能會得到下列結果和相似度分數:
- 文件:開立 NISA 帳戶 (分數:0.51)
- 文件:開立一般儲蓄帳戶 (分數:0.50) <- 分數相似,可能造成混淆
- 文件:房屋貸款申請指南 (分數:0.44)
task_name = "STS"
def get_scores(query, documents):
  # Calculate embeddings by calling model.encode()
  query_embeddings = model.encode(query, prompt_name=task_name)
  doc_embeddings = model.encode(documents, prompt_name=task_name)
  # Calculate the embedding similarities
  similarities = model.similarity(query_embeddings, doc_embeddings)
  for idx, doc in enumerate(documents):
    print("Document: ", doc, "-> 🤖 Score: ", similarities.numpy()[0][idx])
query = "I want to start a tax-free installment investment, what should I do?"
documents = ["Opening a NISA Account", "Opening a Regular Savings Account", "Home Loan Application Guide"]
get_scores(query, documents)
Document: Opening a NISA Account -> 🤖 Score: 0.51571906 Document: Opening a Regular Savings Account -> 🤖 Score: 0.5035889 Document: Home Loan Application Guide -> 🤖 Score: 0.4406476
訓練
使用 Python 等架構,基礎模型會逐漸學習金融詞彙的細微差異。sentence-transformers
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback
loss = MultipleNegativesRankingLoss(model)
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="my-embedding-gemma",
    # Optional training parameters:
    prompts=model.prompts[task_name],    # use model's prompt to train
    num_train_epochs=5,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    # Optional tracking/debugging parameters:
    logging_steps=train_dataset.num_rows,
    report_to="none",
)
class MyCallback(TrainerCallback):
    "A callback that evaluates the model at the end of eopch"
    def __init__(self, evaluate):
        self.evaluate = evaluate # evaluate function
    def on_log(self, args, state, control, **kwargs):
        # Evaluate the model using text generation
        print(f"Step {state.global_step} finished. Running evaluation:")
        self.evaluate()
def evaluate():
  get_scores(query, documents)
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[MyCallback(evaluate)]
)
trainer.train()
Step 3 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.6459116
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.42690125
Document:  Home Loan Application Guide -> 🤖 Score:  0.40419024
Step 6 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.68530923
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3611964
Document:  Home Loan Application Guide -> 🤖 Score:  0.40812016
Step 9 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.7168733
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3449782
Document:  Home Loan Application Guide -> 🤖 Score:  0.44477722
Step 12 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73008573
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34124148
Document:  Home Loan Application Guide -> 🤖 Score:  0.4676212
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
TrainOutput(global_step=15, training_loss=0.009651267528511198, metrics={'train_runtime': 195.3004, 'train_samples_per_second': 0.077, 'train_steps_per_second': 0.077, 'total_flos': 0.0, 'train_loss': 0.009651267528511198, 'epoch': 5.0})
微調後
現在的搜尋結果清楚許多:
- 文件:開設 NISA 帳戶 (分數:0.73) <- 更有把握
- 文件:開立一般儲蓄帳戶 (分數:0.34) <- 明顯較不相關
- 文件:房屋貸款申請指南 (分數:0.47)
get_scores(query, documents)
Document: Opening a NISA Account -> 🤖 Score: 0.73378766 Document: Opening a Regular Savings Account -> 🤖 Score: 0.34055778 Document: Home Loan Application Guide -> 🤖 Score: 0.47503752
如要將模型上傳至 Hugging Face Hub,可以使用 Sentence Transformers 程式庫中的 push_to_hub 方法。
上傳模型後,您就能直接從 Hub 輕鬆存取模型以進行推論、與他人分享,以及管理工作版本。上傳後,只要參照模型的專屬 ID <username>/my-embedding-gemma,任何人都能以單行程式碼載入模型
# Push to Hub
model.push_to_hub("my-embedding-gemma")
摘要和後續步驟
您現在已學會如何使用 Sentence Transformers 程式庫微調 EmbeddingGemma 模型,使其適用於特定領域。
探索 EmbeddingGemma 的其他功能:
- Sentence Transformers 說明文件中的「訓練總覽」
- 使用 Sentence Transformers 生成嵌入
- Gemma 教戰手冊中的簡單 RAG 範例