오디오 입력과 함께 출시되었으며 일상적인 기기에서 사용하도록 최적화된 Gemma 3n을 만나보세요. 자세히 알아보기

이 페이지는 Cloud Translation API를 통해 번역되었습니다.

EmbeddingGemma 미세 조정

미세 조정은 모델의 범용 이해와 애플리케이션에 필요한 전문적이고 고성능의 정확성 간의 격차를 해소하는 데 도움이 됩니다. 모든 작업에 완벽한 단일 모델은 없으므로 미세 조정은 특정 도메인에 맞게 모델을 조정합니다.

'시부야 금융'이라는 회사에서 투자 신탁, NISA 계좌 (세금 혜택이 있는 저축 계좌), 주택 담보 대출과 같은 다양한 복잡한 금융 상품을 제공한다고 가정해 보겠습니다. 고객 지원팀은 내부 기술 자료를 사용하여 고객 질문에 대한 답변을 빠르게 찾습니다.

설정

이 튜토리얼을 시작하기 전에 다음 단계를 완료하세요.

Hugging Face에 로그인하고 Gemma 모델에 대해 라이선스 확인을 선택하여 EmbeddingGemma에 액세스합니다.
Hugging Face 액세스 토큰을 생성하고 이를 사용하여 Colab에서 로그인합니다.

이 노트북은 CPU 또는 GPU에서 실행됩니다.

Python 패키지 설치

EmbeddingGemma 모델을 실행하고 임베딩을 생성하는 데 필요한 라이브러리를 설치합니다. Sentence Transformers는 텍스트 및 이미지 임베딩을 위한 Python 프레임워크입니다. 자세한 내용은 Sentence Transformers 문서를 참고하세요.

pip install -U sentence-transformers git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

라이선스에 동의한 후 모델에 액세스하려면 유효한 Hugging Face 토큰이 필요합니다.

# Login into Hugging Face Hub
from huggingface_hub import login
login()

모델 로드

sentence-transformers 라이브러리를 사용하여 EmbeddingGemma가 포함된 모델 클래스의 인스턴스를 만듭니다.

import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "google/embeddinggemma-300M"
model = SentenceTransformer(model_id).to(device=device)

print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))

Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696

미세 조정 데이터 세트 준비

이 부분이 가장 중요합니다. 특정 맥락에서 '유사'가 무엇을 의미하는지 모델에 알려주는 데이터 세트를 만들어야 합니다. 이 데이터는 (앵커, 포지티브, 네거티브)의 3개로 구성되는 경우가 많습니다.

앵커: 원래 질문 또는 문장입니다.
긍정: 앵커와 의미적으로 매우 유사하거나 동일한 문장입니다.
부정적: 관련 주제에 관한 문장이지만 의미론적으로 다릅니다.

이 예에서는 3개의 트리플만 준비했지만 실제 애플리케이션에서는 성능을 높이기 위해 훨씬 큰 데이터 세트가 필요합니다.

from datasets import Dataset

dataset = [
    ["How do I open a NISA account?", "What is the procedure for starting a new tax-free investment account?", "I want to check the balance of my regular savings account."],
    ["Are there fees for making an early repayment on a home loan?", "If I pay back my house loan early, will there be any costs?", "What is the management fee for this investment trust?"],
    ["What is the coverage for medical insurance?", "Tell me about the benefits of the health insurance plan.", "What is the cancellation policy for my life insurance?"],
]

# Convert the list-based dataset into a list of dictionaries.
data_as_dicts = [ {"anchor": row[0], "positive": row[1], "negative": row[2]} for row in dataset ]

# Create a Hugging Face `Dataset` object from the list of dictionaries.
train_dataset = Dataset.from_list(data_as_dicts)
print(train_dataset)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 3
})

미세 조정 전

'세금 면제 투자'를 검색하면 유사성 점수가 다음과 같은 결과가 표시될 수 있습니다.

문서: NISA 계좌 개설 (점수: 0.51)
문서: 일반 저축 계좌 개설 (점수: 0.50) <- 점수가 유사하여 혼동될 수 있음
문서: 주택담보대출 신청 가이드 (점수: 0.44)

참고: EmbeddingGemma로 최적의 임베딩을 생성하려면 입력 텍스트의 시작 부분에 '지시 프롬프트' 또는 '작업'을 추가해야 합니다. 문장 유사성에는 STS을 사용합니다. 사용 가능한 모든 EmbeddingGemma 프롬프트에 대한 자세한 내용은 모델 카드를 참고하세요.

task_name = "STS"

def get_scores(query, documents):
  # Calculate embeddings by calling model.encode()
  query_embeddings = model.encode(query, prompt_name=task_name)
  doc_embeddings = model.encode(documents, prompt_name=task_name)

  # Calculate the embedding similarities
  similarities = model.similarity(query_embeddings, doc_embeddings)

  for idx, doc in enumerate(documents):
    print("Document: ", doc, "-> 🤖 Score: ", similarities.numpy()[0][idx])

query = "I want to start a tax-free installment investment, what should I do?"
documents = ["Opening a NISA Account", "Opening a Regular Savings Account", "Home Loan Application Guide"]

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.51571906
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.5035889
Document:  Home Loan Application Guide -> 🤖 Score:  0.4406476

학습

Python에서 sentence-transformers와 같은 프레임워크를 사용하면 기본 모델이 금융 어휘의 미묘한 차이를 점진적으로 학습합니다.

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="my-embedding-gemma",
    # Optional training parameters:
    prompts=model.prompts[task_name],    # use model's prompt to train
    num_train_epochs=5,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    # Optional tracking/debugging parameters:
    logging_steps=train_dataset.num_rows,
    report_to="none",
)

class MyCallback(TrainerCallback):
    "A callback that evaluates the model at the end of eopch"
    def __init__(self, evaluate):
        self.evaluate = evaluate # evaluate function

    def on_log(self, args, state, control, **kwargs):
        # Evaluate the model using text generation
        print(f"Step {state.global_step} finished. Running evaluation:")
        self.evaluate()

def evaluate():
  get_scores(query, documents)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    callbacks=[MyCallback(evaluate)]
)
trainer.train()

Step 3 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.6459116
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.42690125
Document:  Home Loan Application Guide -> 🤖 Score:  0.40419024
Step 6 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.68530923
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3611964
Document:  Home Loan Application Guide -> 🤖 Score:  0.40812016
Step 9 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.7168733
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.3449782
Document:  Home Loan Application Guide -> 🤖 Score:  0.44477722
Step 12 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73008573
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34124148
Document:  Home Loan Application Guide -> 🤖 Score:  0.4676212
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
Step 15 finished. Running evaluation:
Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752
TrainOutput(global_step=15, training_loss=0.009651267528511198, metrics={'train_runtime': 195.3004, 'train_samples_per_second': 0.077, 'train_steps_per_second': 0.077, 'total_flos': 0.0, 'train_loss': 0.009651267528511198, 'epoch': 5.0})

미세 조정 후

이제 동일한 검색에서 훨씬 명확한 결과가 표시됩니다.

문서: NISA 계좌 개설 (점수: 0.73) <- 훨씬 더 확신함
문서: 일반 저축 계좌 개설 (점수: 0.34) <- 관련성이 훨씬 낮음
문서: 주택담보대출 신청 가이드 (점수: 0.47)

get_scores(query, documents)

Document:  Opening a NISA Account -> 🤖 Score:  0.73378766
Document:  Opening a Regular Savings Account -> 🤖 Score:  0.34055778
Document:  Home Loan Application Guide -> 🤖 Score:  0.47503752

모델을 Hugging Face Hub에 업로드하려면 Sentence Transformers 라이브러리의 push_to_hub 메서드를 사용하면 됩니다.

모델을 업로드하면 허브에서 직접 추론을 위해 쉽게 액세스하고, 다른 사용자와 공유하고, 작업을 버전 관리할 수 있습니다. 업로드되면 고유한 모델 ID <username>/my-embedding-gemma를 참조하여 한 줄의 코드로 모델을 로드할 수 있습니다.

# Push to Hub
model.push_to_hub("my-embedding-gemma")

요약 및 다음 단계

이제 Sentence Transformers 라이브러리로 EmbeddingGemma 모델을 파인 튜닝하여 특정 도메인에 맞게 조정하는 방법을 배웠습니다.

EmbeddingGemma로 할 수 있는 작업 자세히 알아보기:

Sentence Transformers 문서의 학습 개요
Sentence Transformers로 임베딩 생성하기
Gemma Cookbook의 간단한 RAG 예시