圖像解讀

Gemini 模型從一開始就建構於多模態的基礎上,因此可執行各種圖像處理和電腦視覺工作,包括但不限於生成圖像說明、分類和回答圖像問題,無須訓練專門的機器學習模型。

除了提供一般多模態功能,Gemini 模型還透過額外訓練,針對特定用途 (例如物件偵測區隔) 提升準確度

將圖片傳送給 Gemini

你可以透過下列幾種方式,將圖片做為 Gemini 的輸入內容:

使用網址傳送圖片

您可以使用 Files API 上傳圖片,並在要求中傳遞圖片:

Python

from google import genai

client = genai.Client()

uploaded_file = client.files.upload(file="path/to/organ.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": uploaded_file.uri,
            "mime_type": uploaded_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const uploadedFile = await client.files.upload({
    file: "path/to/organ.jpg",
    config: { mime_type: "image/jpeg" }
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: uploadedFile.uri,
            mime_type: uploadedFile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file using the Files API, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

傳遞內嵌圖片資料

您可以提供採用 Base64 編碼的字串做為圖片資料:

Python

import base64
from google import genai

with open('path/to/small-sample.jpg', 'rb') as f:
    image_bytes = f.read()

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "data": base64.b64encode(image_bytes).decode('utf-8'),
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const client = new GoogleGenAI({});
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
  encoding: "base64",
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            data: base64ImageFile,
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

IMG_PATH="/path/to/your/image1.jpg"

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "data": "'"$(base64 $B64FLAGS $IMG_PATH)"'",
        "mime_type": "image/jpeg"
      }
    ]
  }'

使用 File API 上傳圖片

如要處理大型檔案或重複使用同一張圖片,請使用 Files API。請參閱 Files API 指南

Python

from google import genai

client = genai.Client()

my_file = client.files.upload(file="path/to/sample.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": my_file.uri,
            "mime_type": my_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const myfile = await client.files.upload({
    file: "path/to/sample.jpg",
    config: { mimeType: "image/jpeg" },
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: myfile.uri,
            mime_type: myfile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file (see Files API guide for details)
# Then use the file URI in the request:

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

使用多張圖片做為提示

您可以在單一提示中提供多張圖片,方法是在 input 陣列中加入多個圖片物件:

Python

from google import genai

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "What is different between these two images?"},
        {
            "type": "image",
            "uri": "https://example.com/image1.jpg",
            "mime_type": "image/jpeg"
        },
        {
            "type": "image",
            "uri": "https://example.com/image2.jpg",
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "What is different between these two images?"},
        {
            type: "image",
            uri: "https://example.com/image1.jpg",
            mime_type: "image/jpeg"
        },
        {
            type: "image",
            uri: "https://example.com/image2.jpg",
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "What is different between these two images?"},
      {
        "type": "image",
        "uri": "https://example.com/image1.jpg",
        "mime_type": "image/jpeg"
      },
      {
        "type": "image",
        "uri": "https://example.com/image2.jpg",
        "mime_type": "image/jpeg"
      }
    ]
  }'

物件偵測

模型經過訓練後,可偵測圖片中的物件並取得定界框座標。座標會根據圖片尺寸縮放至 [0, 1000]。您需要根據原始圖片大小,縮放這些座標。

Python

from google import genai
from pydantic import BaseModel, Field
from typing import List
import json

client = genai.Client()
prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."

class BoundingBox(BaseModel):
    box_2d: List[int] = Field(description="The 2D bounding box of the item as [ymin, xmin, ymax, xmax] normalized to 0-1000.")
    mask: List[List[int]] = Field(description="The segmentation mask of the item as a polygon of [x,y] coordinates, normalized to 0-1000.")
    label: str = Field(description="A descriptive label for the item.")

class BoundingBoxes(BaseModel):
    boxes: List[BoundingBox]

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    response_format={
        "type": "text",
        "mime_type": "application/json",
        "schema": BoundingBoxes.model_json_schema()
    }
)

bounding_boxes = BoundingBoxes.model_validate_json(interaction.steps[-1].content[0].text)
print(bounding_boxes)

JavaScript

import { GoogleGenAI } from "@google/genai";
import * as z from "zod";

const client = new GoogleGenAI({});
const prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000.";

const boundingBoxesJsonSchema = {
  type: "object",
  properties: {
    boxes: {
      type: "array",
      items: {
        type: "object",
        properties: {
          box_2d: { type: "array", items: { type: "integer" }, description: "The 2D bounding box of the item as [ymin, xmin, ymax, xmax] normalized to 0-1000." },
          mask: { type: "array", items: { type: "array", items: { type: "integer" } }, description: "The segmentation mask of the item as a polygon of [x,y] coordinates, normalized to 0-1000." },
          label: { type: "string", description: "A descriptive label for the item." }
        },
        required: ["box_2d", "mask", "label"]
      }
    }
  },
  required: ["boxes"]
};

const boundingBoxesSchema = z.fromJSONSchema(boundingBoxesJsonSchema);

const interaction = await client.interactions.create({
  model: "gemini-3-flash-preview",
  input: [
    { type: "text", text: prompt },
    {
      type: "image",
      uri: "https://example.com/image.png",
      mime_type: "image/png"
    }
  ],
  response_format: {
    type: 'text',
    mime_type: 'application/json',
    schema: boundingBoxesJsonSchema
  },
});

const result = boundingBoxesSchema.parse(JSON.parse(interaction.steps.at(-1).content[0].text));
console.log(result);

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."},
      {
        "type": "image",
        "uri": "https://example.com/image.png",
        "mime_type": "image/png"
      }
    ],
    "response_format": {
      "type": "text",
      "mime_type": "application/json",
      "schema": {
        "type": "object",
        "properties": {
          "boxes": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "box_2d": { "type": "array", "items": { "type": "integer" } },
                "mask": { "type": "array", "items": { "type": "array", "items": { "type": "integer" } } },
                "label": { "type": "string" }
              },
              "required": ["box_2d", "mask", "label"]
            }
          }
        },
        "required": ["boxes"]
      }
    }
  }'

如需更多範例,請參閱 Gemini 教戰手冊中的下列筆記本:

區隔

從 Gemini 2.5 開始,模型不僅能偵測項目,還能區隔項目並提供輪廓遮罩。

模型會預測 JSON 清單,其中每個項目都代表區隔遮罩。每個項目都有定界框 (「box_2d」),格式為 [y0, x0, y1, x1],其中包含介於 0 到 1000 之間的標準化座標、可識別物件的標籤 (「label」),以及定界框內的區隔遮罩 (以 Base64 編碼的 PNG 格式,是值介於 0 到 255 之間的機率地圖)。

Python

from google import genai
from pydantic import BaseModel, Field
from typing import List
import json

client = genai.Client()

prompt = """
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
"""

class BoundingBox(BaseModel):
    box_2d: List[int] = Field(description="The 2D bounding box of the item as [ymin, xmin, ymax, xmax] normalized to 0-1000.")
    mask: List[List[int]] = Field(description="The segmentation mask of the item as a polygon of [x,y] coordinates, normalized to 0-1000.")
    label: str = Field(description="A descriptive label for the item.")

class BoundingBoxes(BaseModel):
    boxes: List[BoundingBox]

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    response_format={
        "type": "text",
        "mime_type": "application/json",
        "schema": BoundingBoxes.model_json_schema()
    },
    generation_config={
        "thinking_level": "minimal"  # Minimize thinking for better detection results
    }
)

items = BoundingBoxes.model_validate_json(interaction.steps[-1].content[0].text)
print("Segmentation results:", items)

JavaScript

import { GoogleGenAI } from "@google/genai";
import * as z from "zod";

const client = new GoogleGenAI({});
const prompt = `
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
`;

const boundingBoxesJsonSchema = {
  type: "object",
  properties: {
    boxes: {
      type: "array",
      items: {
        type: "object",
        properties: {
          box_2d: { type: "array", items: { type: "integer" }, description: "The 2D bounding box of the item as [ymin, xmin, ymax, xmax] normalized to 0-1000." },
          mask: { type: "array", items: { type: "array", items: { type: "integer" } }, description: "The segmentation mask of the item as a polygon of [x,y] coordinates, normalized to 0-1000." },
          label: { type: "string", description: "A descriptive label for the item." }
        },
        required: ["box_2d", "mask", "label"]
      }
    }
  },
  required: ["boxes"]
};

const boundingBoxesSchema = z.fromJSONSchema(boundingBoxesJsonSchema);

const interaction = await client.interactions.create({
  model: "gemini-3-flash-preview",
  input: [
    { type: "text", text: prompt },
    {
      type: "image",
      uri: "https://example.com/image.png",
      mime_type: "image/png"
    }
  ],
  response_format: {
    type: 'text',
    mime_type: 'application/json',
    schema: boundingBoxesJsonSchema
  },
  generationConfig: {
    thinking_level: "minimal"
  }
});

const result = boundingBoxesSchema.parse(JSON.parse(interaction.steps.at(-1).content[0].text));
console.log(result);

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Give the segmentation masks for the wooden and glass items.\nOutput a JSON list of segmentation masks where each entry contains the 2D\nbounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and\nthe text label in the key \"label\". Use descriptive labels."},
      {
        "type": "image",
        "uri": "https://example.com/image.png",
        "mime_type": "image/png"
      }
    ],
    "response_format": {
      "type": "text",
      "mime_type": "application/json",
      "schema": {
        "type": "object",
        "properties": {
          "boxes": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "box_2d": { "type": "array", "items": { "type": "integer" } },
                "mask": { "type": "array", "items": { "type": "array", "items": { "type": "integer" } } },
                "label": { "type": "string" }
              },
              "required": ["box_2d", "mask", "label"]
            }
          }
        },
        "required": ["boxes"]
      }
    },
    "config": {
      "thinking_level": "minimal"
    }
  }'
桌上擺著杯子蛋糕,木製和玻璃物品以亮色標示
含有物件和區隔遮罩的區隔輸出範例

支援的圖片格式

Gemini 支援下列圖片格式 MIME 類型:

  • PNG - image/png
  • JPEG - image/jpeg
  • WebP - image/webp
  • HEIC - image/heic
  • HEIF - image/heif

如要瞭解其他檔案輸入方式,請參閱「檔案輸入方式」指南。

功能

所有 Gemini 模型版本都是多模態模型,可用於各種圖像處理和電腦視覺工作,包括但不限於圖像說明、視覺問答、圖像分類、物件偵測和分割。

視品質和效能需求而定,Gemini 可減少使用專業機器學習模型的需求。

最新模型版本經過特別訓練,除了強化物件偵測區隔等一般功能外,還能提升特定工作的準確度。

限制和重要技術資訊

檔案限制

Gemini 模型每項要求最多可支援 3,600 個圖片檔案。

計算權杖

  • 如果兩個維度都 <= 384 像素,則為 258 個權杖。 較大的圖片會分割成 768x768 像素的圖塊,每個圖塊需支付 258 個權杖。

計算圖塊數量的粗略公式如下:

  • 計算裁剪單元大小 (約為 floor(min(width, height) / 1.5)。
  • 將每個維度除以裁剪單元大小,然後相乘,即可取得圖塊數量。

舉例來說,如果圖片尺寸為 960x540,裁剪單位大小為 360。將每個維度除以 360,圖塊數量為 3 * 2 = 6。

媒體解析度

Gemini 3 推出 media_resolution 參數,可精細控管多模態視覺處理作業。media_resolution 參數會決定每個輸入圖片或影片影格分配到的詞元數量上限。 解析度越高,模型就越能辨識細小文字或細節,但也會增加權杖用量和延遲時間。

提示與最佳做法

  • 確認圖片已正確旋轉。
  • 使用清晰的圖片,避免模糊不清。
  • 使用含有文字的單一圖片時,請將文字提示詞放在 input 陣列中的圖片前面

後續步驟

本指南說明如何上傳圖片檔案,以及如何從圖片輸入內容生成文字輸出內容。如要進一步瞭解相關內容,請參閱下列資源:

  • Files API:進一步瞭解如何上傳及管理檔案,以供 Gemini 使用。
  • 系統指令: 系統指令可根據特定需求和用途,引導模型行為。
  • 檔案提示策略:Gemini API 支援使用文字、圖片、音訊和影片資料提示,也就是多模態提示。
  • 安全指南:生成式 AI 模型有時會產生出乎意料的輸出內容,例如不準確、有偏見或令人反感的內容。後續處理和人工評估是不可或缺的環節,有助於降低這類輸出內容造成危害的風險。