音声入力を備え、日常的なデバイスでの使用向けに最適化された Gemma 3n がリリースされました。詳細

このページは Cloud Translation API によって翻訳されました。

EmbeddingGemma をファインチューニングする

ファインチューニングは、モデルの汎用的な理解と、アプリケーションに必要な専門的で高性能な精度とのギャップを埋めるのに役立ちます。すべてのタスクに最適なモデルはないため、ファインチューニングによって特定のドメインに適応させます。

たとえば、お客様の会社「渋谷ファイナンシャル」が、投資信託、NISA 口座（税制優遇貯蓄口座）、住宅ローンなどの複雑な金融商品を扱っているとします。カスタマーサポートチームは、社内ナレッジベースを使用して、お客様からの問い合わせに対する回答をすばやく見つけています。

セットアップ

このチュートリアルを開始する前に、次の手順を完了してください。

Hugging Face にログインし、Gemma モデルの [ライセンスを承認] を選択して、EmbeddingGemma へのアクセス権を取得します。
Hugging Face のアクセストークンを生成し、それを使用して Colab からログインします。

このノートブックは CPU または GPU で実行されます。

Python パッケージをインストールする

EmbeddingGemma モデルの実行とエンベディングの生成に必要なライブラリをインストールします。Sentence Transformers は、テキストと画像のエンベディング用の Python フレームワークです。詳細については、Sentence Transformers のドキュメントをご覧ください。

pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

ライセンスに同意したら、モデルにアクセスするための有効な Hugging Face トークンが必要です。

# Login into Hugging Face Hub
from huggingface_hub import login
login()

モデルを読み込む

sentence-transformers ライブラリを使用して、EmbeddingGemma を使用してモデルクラスのインスタンスを作成します。

import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "google/embeddinggemma-300M"
model = SentenceTransformer(model_id).to(device=device)

print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))

Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696

ファインチューニングデータセットを準備する

これが最も重要な部分です。特定のコンテキストで「類似」が何を意味するのかをモデルに教えるデータセットを作成する必要があります。このデータは、多くの場合、(アンカー、ポジティブ、ネガティブ) の 3 つ組として構造化されます。

アンカー: 元のクエリまたは文。
ポジティブ: アンカーと意味的に非常に類似しているか同一の文。
ネガティブ: 関連するトピックに関する文だが、意味的に異なる文。

この例では 3 つのトリプレットしか用意していませんが、実際のアプリケーションでは、十分なパフォーマンスを発揮するために、はるかに大きなデータセットが必要になります。

from datasets import Dataset

dataset = [
    ["How do I open a NISA account?", "What is the procedure for starting a new tax-free investment account?", "I want to check the balance of my regular savings account."],
    ["Are there fees for making an early repayment on a home loan?", "If I pay back my house loan early, will there be any costs?", "What is the management fee for this investment trust?"],
    ["What is the coverage for medical insurance?", "Tell me about the benefits of the health insurance plan.", "What is the cancellation policy for my life insurance?"],
]

# Convert the list-based dataset into a list of dictionaries.
data_as_dicts = [ {"anchor": row[0], "positive": row[1], "negative": row[2]} for row in dataset ]

# Create a Hugging Face `Dataset` object from the list of dictionaries.
train_dataset = Dataset.from_list(data_as_dicts)
print(train_dataset)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 3
})

ファインチューニング前

「非課税投資」の検索結果と類似性スコアは次のようになります。

ドキュメント: NISA 口座の開設（スコア: 0.51）
ドキュメント: 普通預金口座の開設（スコア: 0.50）<- スコアが類似しており、混乱を招く可能性がある
ドキュメント: 住宅ローン申請ガイド（スコア: 0.44）

注: EmbeddingGemma で最適なエンベディングを生成するには、入力テキストの先頭に「指示プロンプト」または「タスク」を追加する必要があります。文の類似性には STS を使用します。使用可能なすべての EmbeddingGemma プロンプトの詳細については、モデルカードをご覧ください。

task_name = "STS"

def get_scores(query, documents):
  # Calculate embeddings by calling model.encode()
  query_embeddings = model.encode(query, prompt_name=task_name)
  doc_embeddings = model.encode(documents, prompt_name=task_name)

  # Calculate the embedding similarities
  similarities = model.similarity(query_embeddings, doc_embeddings)

  for idx, doc in enumerate(documents):
    print("Document: ", doc, "-> 🤖 Score: ", similarities.numpy()[0][idx])

query = "I want to start a tax-free installment investment, what should I do?"
documents = ["Opening a NISA Account", "Opening a Regular Savings Account", "Home Loan Application Guide"]

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.51571906
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.5035889
Document:  Home Loan Application Guide -> 🤖 Score:  0.4406476

トレーニング

Python の sentence-transformers などのフレームワークを使用すると、ベースモデルは財務用語の微妙な違いを徐々に学習します。

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="my-embedding-gemma",
    # Optional training parameters:
    prompts=model.prompts[task_name],    # use model's prompt to train
    num_train_epochs=5,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    # Optional tracking/debugging parameters:
    logging_steps=train_dataset.num_rows,
    report_to="none",
)

class MyCallback(TrainerCallback):
    "A callback that evaluates the model at the end of eopch"
    def __init__(self, evaluate):
        self.evaluate = evaluate # evaluate function

    def on_log(self, args, state, control, **kwargs):
        # Evaluate the model using text generation
        print(f"Step {state.global_step} finished. Running evaluation:")
        self.evaluate()

def evaluate():
  get_scores(query, documents)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[MyCallback(evaluate)]
)
trainer.train()

Step 3 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.6459116
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.42690125
Document:  Home Loan Application Guide -> 🤖 Score:  0.40419024
Step 6 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.68530923
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3611964
Document:  Home Loan Application Guide -> 🤖 Score:  0.40812016
Step 9 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.7168733
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3449782
Document:  Home Loan Application Guide -> 🤖 Score:  0.44477722
Step 12 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73008573
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34124148
Document:  Home Loan Application Guide -> 🤖 Score:  0.4676212
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
TrainOutput(global_step=15, training_loss=0.009651267528511198, metrics={'train_runtime': 195.3004, 'train_samples_per_second': 0.077, 'train_steps_per_second': 0.077, 'total_flos': 0.0, 'train_loss': 0.009651267528511198, 'epoch': 5.0})

ファインチューニング後

同じ検索で、より明確な結果が得られるようになりました。

ドキュメント: NISA 口座の開設（スコア: 0.73）<- 自信が大幅に高まった
ドキュメント: 普通預金口座の開設（スコア: 0.34）<- 関連性が明らかに低い
ドキュメント: 住宅ローン申請ガイド（スコア: 0.47）

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752

モデルを Hugging Face Hub にアップロードするには、Sentence Transformers ライブラリの push_to_hub メソッドを使用します。

モデルをアップロードすると、Hub から推論に簡単にアクセスしたり、他のユーザーと共有したり、作業のバージョン管理を行ったりできます。アップロードすると、一意のモデル ID <username>/my-embedding-gemma を参照するだけで、誰でも 1 行のコードでモデルを読み込むことができます。

# Push to Hub
model.push_to_hub("my-embedding-gemma")

まとめと次のステップ

これで、Sentence Transformers ライブラリを使用して EmbeddingGemma モデルをファインチューニングし、特定のドメインに適応させる方法を学習しました。

EmbeddingGemma でできること:

EmbeddingGemma をファインチューニングする

セットアップ

Python パッケージをインストールする

モデルを読み込む

ファインチューニング データセットを準備する

ファインチューニング前

トレーニング

ファインチューニング後

まとめと次のステップ

ファインチューニングデータセットを準備する