使用 Hugging Face Transformer 和 QLoRA 為 Gemma 進行視覺任務的微調

本指南將逐步說明如何使用 Hugging Face Transformers 和 TRL，針對視覺任務 (產生產品說明) 在自訂圖片和文字資料集上微調 Gemma。您將學會：

量化低秩調整 (QLoRA) 是什麼
設定開發環境
建立及準備視覺任務的精修資料集
使用 TRL 和 SFTTrainer 對 Gemma 進行微調
測試模型推論，並根據圖片和文字產生產品說明。

量化低秩調整 (QLoRA) 是什麼

本指南將說明如何使用量化低秩序調整 (QLoRA)，這是一種有效精細調整 LLM 的熱門方法，因為它可減少運算資源需求，同時維持高效能。在 QloRA 中，預先訓練的模型會量化為 4 位元，權重則會凍結。接著，系統會附加可訓練的轉接層 (LoRA)，並只訓練轉接層。之後，轉接器權重可與基礎模型合併，或保留為獨立的轉接器。

設定開發環境

第一步是安裝 Hugging Face 程式庫 (包括 TRL) 和資料集，以便微調開放式模型。

# Install Pytorch & other libraries
%pip install "torch>=2.4.0" tensorboard torchvision

# Install Gemma release branch from Hugging Face
%pip install "transformers>=4.51.3"

# Install Hugging Face libraries
%pip install  --upgrade \
  "datasets==3.3.2" \
  "accelerate==1.4.0" \
  "evaluate==0.4.3" \
  "bitsandbytes==0.45.3" \
  "trl==0.15.2" \
  "peft==0.14.0" \
  "pillow==11.1.0" \
  protobuf \
  sentencepiece

請務必先接受 Gemma 的使用條款，才能開始訓練。您可以接受 Hugging Face 的授權，方法是點選模型頁面 (http://huggingface.co/google/gemma-3-4b-pt) 上的「同意並存取存放區」按鈕 (或您使用的具備視覺功能 Gemma 模型的適當模型頁面)。

接受授權後，您必須使用有效的 Hugging Face 權杖才能存取模型。如果您在 Google Colab 中執行，可以使用 Colab 機密資料安全地使用 Hugging Face 權杖；否則，您可以直接在 login 方法中設定權杖。請確認您的權杖也有寫入權限，因為您會在訓練期間將模型推送至 Hub。

from google.colab import userdata
from huggingface_hub import login

# Login into Hugging Face Hub
hf_token = userdata.get('HF_TOKEN') # If you are running inside a Google Colab
login(hf_token)

建立及準備精修資料集

在微調 LLM 時，請務必瞭解您的用途和要解決的任務。這有助於您建立資料集，以便微調模型。如果您尚未定義用途，建議您重新規劃。

本指南會以以下用途為例：

微調 Gemma 模型，為電子商務平台產生精簡的 SEO 最佳化產品說明，特別針對行動搜尋進行調整。

本指南使用 philschmid/amazon-product-descriptions-vlm 資料集，這是 Amazon 產品說明資料集，包括產品圖片和類別。

Hugging Face TRL 支援多模態對話。其中最重要的部分是「image」角色，可告知處理類別應載入圖片。結構應符合以下規定：

{"messages": [{"role": "system", "content": [{"type": "text", "text":"You are..."}]}, {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image"}]}, {"role": "assistant", "content": [{"type": "text", "text": "..."}]}]}
{"messages": [{"role": "system", "content": [{"type": "text", "text":"You are..."}]}, {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image"}]}, {"role": "assistant", "content": [{"type": "text", "text": "..."}]}]}
{"messages": [{"role": "system", "content": [{"type": "text", "text":"You are..."}]}, {"role": "user", "content": [{"type": "text", "text": "..."}, {"type": "image"}]}, {"role": "assistant", "content": [{"type": "text", "text": "..."}]}]}

您現在可以使用 Hugging Face 資料集程式庫載入資料集，並建立提示範本，將圖片、產品名稱和類別結合，並新增系統訊息。資料集會將圖片納入 Pil.Image 物件。

from datasets import load_dataset
from PIL import Image

# System message for the assistant
system_message = "You are an expert product description writer for Amazon."

# User prompt that combines the user query and the schema
user_prompt = """Create a Short Product description based on the provided <PRODUCT> and <CATEGORY> and image.
Only return description. The description should be SEO optimized and for a better mobile search experience.

<PRODUCT>
{product}
</PRODUCT>

<CATEGORY>
{category}
</CATEGORY>
"""

# Convert dataset to OAI messages
def format_data(sample):
    return {
        "messages": [
            {
                "role": "system",
                "content": [{"type": "text", "text": system_message}],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": user_prompt.format(
                            product=sample["Product Name"],
                            category=sample["Category"],
                        ),
                    },
                    {
                        "type": "image",
                        "image": sample["image"],
                    },
                ],
            },
            {
                "role": "assistant",
                "content": [{"type": "text", "text": sample["description"]}],
            },
        ],
    }

def process_vision_info(messages: list[dict]) -> list[Image.Image]:
    image_inputs = []
    # Iterate through each conversation
    for msg in messages:
        # Get content (ensure it's a list)
        content = msg.get("content", [])
        if not isinstance(content, list):
            content = [content]

        # Check each content element for images
        for element in content:
            if isinstance(element, dict) and (
                "image" in element or element.get("type") == "image"
            ):
                # Get the image and convert to RGB
                if "image" in element:
                    image = element["image"]
                else:
                    image = element
                image_inputs.append(image.convert("RGB"))
    return image_inputs

# Load dataset from the hub
dataset = load_dataset("philschmid/amazon-product-descriptions-vlm", split="train")

# Convert dataset to OAI messages
# need to use list comprehension to keep Pil.Image type, .mape convert image to bytes
dataset = [format_data(sample) for sample in dataset]

print(dataset[345]["messages"])

使用 TRL 和 SFTTrainer 對 Gemma 進行微調

您現在可以微調模型了。透過 Hugging Face TRL 的 SFTTrainer，您可以輕鬆監督開放式 LLM 的微調作業。SFTTrainer 是 transformers 程式庫中 Trainer 的子類別，支援所有相同的功能，包括記錄、評估和檢查點，但會新增其他便利功能，包括：

資料集格式，包括對話和指示格式
只訓練完成動作，忽略提示
壓縮資料集以提高訓練效率
高效參數微調 (PEFT) 支援功能，包括 QloRA
準備對話微調的模型和代碼化工具 (例如新增特殊符記)

下列程式碼會從 Hugging Face 載入 Gemma 模型和分析器，並初始化量化設定。

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

# Hugging Face model id
model_id = "google/gemma-3-4b-pt" # or `google/gemma-3-12b-pt`, `google/gemma-3-27-pt`

# Check if GPU benefits from bfloat16
if torch.cuda.get_device_capability()[0] < 8:
    raise ValueError("GPU does not support bfloat16, please use a GPU that supports bfloat16.")

# Define model init arguments
model_kwargs = dict(
    attn_implementation="eager", # Use "flash_attention_2" when running on Ampere or newer GPU
    torch_dtype=torch.bfloat16, # What torch dtype to use, defaults to auto
    device_map="auto", # Let torch decide how to load the model
)

# BitsAndBytesConfig int-4 config
model_kwargs["quantization_config"] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=model_kwargs["torch_dtype"],
    bnb_4bit_quant_storage=model_kwargs["torch_dtype"],
)

# Load model and tokenizer
model = AutoModelForImageTextToText.from_pretrained(model_id, **model_kwargs)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

SFTTrainer 支援內建整合 peft，可讓您輕鬆使用 QLoRA 有效調整 LLM。您只需建立 LoraConfig 並提供給訓練工具即可。

from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
    modules_to_save=[
        "lm_head",
        "embed_tokens",
    ],
)

開始訓練前，您必須定義要在 SFTConfig 中使用的超參數，以及用於處理視覺處理作業的自訂 collate_fn。collate_fn 會將含有文字和圖片的訊息轉換為模型可解讀的格式。

from trl import SFTConfig

args = SFTConfig(
    output_dir="gemma-product-description",     # directory to save and repository id
    num_train_epochs=1,                         # number of training epochs
    per_device_train_batch_size=1,              # batch size per device during training
    gradient_accumulation_steps=4,              # number of steps before performing a backward/update pass
    gradient_checkpointing=True,                # use gradient checkpointing to save memory
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    logging_steps=5,                            # log every 5 steps
    save_strategy="epoch",                      # save checkpoint every epoch
    learning_rate=2e-4,                         # learning rate, based on QLoRA paper
    bf16=True,                                  # use bfloat16 precision
    max_grad_norm=0.3,                          # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                          # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",               # use constant learning rate scheduler
    push_to_hub=True,                           # push model to hub
    report_to="tensorboard",                    # report metrics to tensorboard
    gradient_checkpointing_kwargs={
        "use_reentrant": False
    },  # use reentrant checkpointing
    dataset_text_field="",                      # need a dummy field for collator
    dataset_kwargs={"skip_prepare_dataset": True},  # important for collator
)
args.remove_unused_columns = False # important for collator

# Create a data collator to encode text and image pairs
def collate_fn(examples):
    texts = []
    images = []
    for example in examples:
        image_inputs = process_vision_info(example["messages"])
        text = processor.apply_chat_template(
            example["messages"], add_generation_prompt=False, tokenize=False
        )
        texts.append(text.strip())
        images.append(image_inputs)

    # Tokenize the texts and process the images
    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

    # The labels are the input_ids, and we mask the padding tokens and image tokens in the loss computation
    labels = batch["input_ids"].clone()

    # Mask image tokens
    image_token_id = [
        processor.tokenizer.convert_tokens_to_ids(
            processor.tokenizer.special_tokens_map["boi_token"]
        )
    ]
    # Mask tokens for not being used in the loss computation
    labels[labels == processor.tokenizer.pad_token_id] = -100
    labels[labels == image_token_id] = -100
    labels[labels == 262144] = -100

    batch["labels"] = labels
    return batch

您現在已擁有建立 SFTTrainer 所需的所有建構區塊，可以開始訓練模型了。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=processor,
    data_collator=collate_fn,
)

呼叫 train() 方法開始訓練。

# Start training, the model will be automatically saved to the Hub and the output directory
trainer.train()

# Save the final model again to the Hugging Face Hub
trainer.save_model()

請務必釋放記憶體，才能測試模型。

# free the memory again
del model
del trainer
torch.cuda.empty_cache()

使用 QLoRA 時，您只需訓練轉接器，而非完整模型。也就是說，在訓練期間儲存模型時，您只會儲存適應器權重，而非完整模型。如果您想儲存完整模型，以便搭配 vLLM 或 TGI 等服務堆疊使用，可以使用 merge_and_unload 方法將轉接器權重合併至模型權重，然後使用 save_pretrained 方法儲存模型。這會儲存可用於推論的預設模型。

from peft import PeftModel

# Load Model base model
model = AutoModelForImageTextToText.from_pretrained(model_id, low_cpu_mem_usage=True)

# Merge LoRA and base model and save
peft_model = PeftModel.from_pretrained(model, args.output_dir)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True, max_shard_size="2GB")

processor = AutoProcessor.from_pretrained(args.output_dir)
processor.save_pretrained("merged_model")

測試模型推論並產生產品說明

訓練完成後，您需要評估及測試模型。您可以從測試資料集中載入不同的樣本，並針對這些樣本評估模型。

import torch

# Load Model with PEFT adapter
model = AutoModelForImageTextToText.from_pretrained(
  args.output_dir,
  device_map="auto",
  torch_dtype=torch.bfloat16,
  attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained(args.output_dir)

您可以提供產品名稱、類別和圖片，藉此測試推論功能。sample 包含漫威的動作公仔。

import requests
from PIL import Image

# Test sample with Product Name, Category and Image
sample = {
  "product_name": "Hasbro Marvel Avengers-Serie Marvel Assemble Titan-Held, Iron Man, 30,5 cm Actionfigur",
  "category": "Toys & Games | Toy Figures & Playsets | Action Figures",
  "image": Image.open(requests.get("https://m.media-amazon.com/images/I/81+7Up7IWyL._AC_SY300_SX300_.jpg", stream=True).raw).convert("RGB")
}

def generate_description(sample, model, processor):
    # Convert sample into messages and then apply the chat template
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_message}]},
        {"role": "user", "content": [
            {"type": "image","image": sample["image"]},
            {"type": "text", "text": user_prompt.format(product=sample["product_name"], category=sample["category"])},
        ]},
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    # Process the image and text
    image_inputs = process_vision_info(messages)
    # Tokenize the text and process the images
    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    # Move the inputs to the device
    inputs = inputs.to(model.device)

    # Generate the output
    stop_token_ids = [processor.tokenizer.eos_token_id, processor.tokenizer.convert_tokens_to_ids("<end_of_turn>")]
    generated_ids = model.generate(**inputs, max_new_tokens=256, top_p=1.0, do_sample=True, temperature=0.8, eos_token_id=stop_token_ids, disable_compile=True)
    # Trim the generation and decode the output to text
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0]

# generate the description
description = generate_description(sample, model, processor)
print(description)

總結與後續步驟

本教學課程說明如何使用 TRL 和 QLoRA 微調 Gemma 模型，以便執行視覺任務，特別是產生產品說明。接著請查看下列文件：

瞭解如何使用 Gemma 模型產生文字。
瞭解如何使用 Hugging Face Transformer 微調 Gemma 以執行文字工作。
瞭解如何對 Gemma 模型執行分散式微調和推論。
瞭解如何搭配 Vertex AI 使用 Gemma 開放式模型。
瞭解如何使用 KerasNLP 微調 Gemma 並部署至 Vertex AI。