Gemma 4 发布，支持文本、音频和图片输入，上下文窗口最长可达 25.6 万个 token！了解详情

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

使用 Hugging Face Transformers 运行 Gemma

使用 Gemma 开放模型，您可以完成生成文本、总结和分析内容等任务。本教程将向您展示如何开始使用 Hugging Face Transformers 运行 Gemma，使用文本和图片输入来生成文本内容。Transformers Python 库提供了一个 API，用于访问预训练的生成式 AI 模型，包括 Gemma。如需了解详情，请参阅 Transformers 文档。

安装 Python 软件包

安装运行 Gemma 模型和发出请求所需的 Hugging Face 库。

# Install Pytorch
%pip install torch

# Install a transformers
%pip install transformers

根据文本生成文本

使用文本提示 Gemma 模型以获取文本响应是使用 Gemma 的最简单方式，并且适用于几乎所有 Gemma 变体。本部分介绍了如何使用 Hugging Face Transformers 库加载和配置 Gemma 模型，以实现文本到文本的生成。

加载模型

使用 torch 和 transformers 库创建具有 Gemma 的模型执行 pipeline 类的实例。使用模型生成输出或遵循指示时，请选择经过指令调整 (IT) 的模型，该模型通常在模型 ID 字符串中包含 it。使用 pipeline 对象，您可以指定要使用的 Gemma 变体、要执行的任务类型，特别是用于多模态生成的 "any-to-any"，如以下代码示例所示：

from transformers import pipeline

MODEL_ID = "google/gemma-4-E2B-it"

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto"
)

config.json: 0.00B [00:00, ?B/s]
model.safetensors:   0%|          | 0.00/10.2G [00:00<?, ?B/s]
Loading weights:   0%|          | 0/2011 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]
processor_config.json: 0.00B [00:00, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

Gemma 仅支持少数几个用于生成的 task 设置。如需详细了解可用的 task 设置，请参阅 Hugging Face Pipelines task() 文档。如需详细了解如何使用 Pipeline 类，请参阅 Hugging Face Pipelines 文档。

运行文本生成

在 pipeline 对象中加载并配置 Gemma 模型后，您可以向该模型发送提示。以下示例代码展示了使用 text 参数的基本请求：

pipe(text="<|turn>user\nroses are red<turn|>\n<|turn>model\n")

Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
[{'input_text': '<|turn>user\nroses are red<turn|>\n<|turn>model\n',
  'generated_text': '<|turn>user\nroses are red<turn|>\n<|turn>model\nThat\'s a classic phrase, often used to highlight a contrast or a truth.\n\n**"Roses are red"** is a very popular, simple, and sweet arrangement.\n\nWhat would you like to do with this phrase? Are you looking for:\n\n1. **More rhymes or phrases?**\n2. **A continuation of a thought?**\n3. **Just appreciating the simplicity?**'}]

使用提示模板

使用更复杂的提示生成内容时，请使用提示模板来构建请求。提示模板允许您指定来自特定角色（例如 user 或 model）的输入，并且是管理与 Gemma 模型的多轮对话交互的必需格式。以下示例代码展示了如何为 Gemma 构建提示模板：

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 512
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "Roses are red..."}]
    },
]

pipe(messages, return_full_text=False, generate_kwargs=gen_kwargs)

[{'input_text': [{'role': 'system',
    'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]},
   {'role': 'user',
    'content': [{'type': 'text', 'text': 'Roses are red...'}]}],
  'generated_text': 'Roses are red,\nViolets are blue,\nHow lovely to see\nA beautiful view.'}]

根据图片数据生成文本

从 Gemma 3 开始，对于 4B 及更高版本的模型大小，您可以使用图片数据作为提示的一部分。本部分介绍了如何使用 Transformers 库加载和配置 Gemma 模型，以使用图片数据和文本输入来生成文本输出。

使用提示模板

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 512
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://ai.google.dev/static/gemma/docs/images/thali-indian-plate.jpg"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "This image shows"},
        ],
    },
]

pipe(text=messages, return_full_text=False, generate_kwargs=gen_kwargs)

[{'input_text': [{'role': 'user',
    'content': [{'type': 'image',
      'url': 'https://ai.google.dev/static/gemma/docs/images/thali-indian-plate.jpg'},
     {'type': 'text', 'text': 'What is shown in this image?'}]},
   {'role': 'assistant',
    'content': [{'type': 'text', 'text': 'This image shows'}]}],
  'generated_text': " a platter of Indian food, likely a meal or an assortment of dishes.\n\nHere's a breakdown of what is visible:\n\n*   **Flatbread:** There is a large, golden-brown flatbread (possibly naan or roti) dominating the center of the platter.\n*   **Dips/Sides:** There are several small bowls containing various accompaniments:\n    *   A bowl of **yellow/mustard-colored dip** (perhaps a chutney or sauce).\n    *   A bowl of **white creamy dip** (like raita or yogurt sauce).\n    *   A portion of **white rice**.\n    *   Several bowls of **curries or sauces** in different colors:\n        *   An **orange/brown curry**.\n        *   A **deep yellow/orange sauce**.\n        *   A **green sauce** (likely a chutney).\n*   **Garnish/Side Item:** In the upper right corner, there appears to be some darker, textured items, possibly fried pieces or spices.\n*   **Platter:** The food is served on a metal platter.\n\nOverall, it looks like a traditional Indian meal setup featuring bread, rice, and various flavorful sauces/curries."}]

您可以在 content 列表中添加额外的 "type": "image", 条目，从而在提示中添加多张图片。

注意：请勿在提示模板的文本部分中使用 <|image|>、<start_of_image> 或 <image_soft_token> token，因为这种方法会创建冗余 token 并导致处理错误。

根据音频数据生成文本

借助 Gemma 4 和 Gemma 3n，您可以使用音频数据作为提示的一部分。本部分介绍了如何使用 Transformers 库加载和配置 Gemma 模型，以使用音频数据和文本输入来生成文本输出。

使用提示模板

使用音频生成内容时，请使用提示模板来构建请求。提示模板允许您指定来自特定角色（例如 user 或 model）的输入，并且是管理与 Gemma 模型的多轮对话交互的必需格式。以下示例代码展示了如何为 Gemma 构建提示模板，其中包含音频数据输入：

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(MODEL_ID)
config.max_new_tokens = 512
gen_kwargs = dict(generation_config=config)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
        ]
    }
]

pipe(text=messages, return_full_text=False, generate_kwargs=gen_kwargs)

[{'input_text': [{'role': 'user',
    'content': [{'type': 'text',
      'text': 'Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.'},
     {'type': 'audio',
      'audio': 'https://ai.google.dev/gemma/docs/audio/roses-are.wav'}]}],
  'generated_text': 'Roses are red, violets are blue.'}]

您可以在 content 列表中添加额外的 "type": "audio", 条目，从而在提示中添加多个音频文件。

注意：请勿在提示模板的文本部分中使用 <|audio|> 或 <audio_soft_token> token，因为这种方法会创建冗余 token 并导致处理错误。

后续步骤

使用 Gemma 模型构建和探索更多内容：