Gemini Deep Research 現已推出預先發布版，提供協作規劃、視覺化、MCP 支援等功能。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

圖像解讀

注意：這個版本的頁面涵蓋目前為 Beta 版的新版 Interactions API。
如要穩定部署正式版，建議繼續使用 generateContent API。您可以使用這個頁面上的切換鈕，在不同版本之間切換。

Gemini 模型從一開始就建構於多模態的基礎上，因此可執行各種圖像處理和電腦視覺工作，包括但不限於生成圖像說明、分類和回答圖像問題，無須訓練專門的機器學習模型。

除了提供一般多模態功能，Gemini 模型還透過額外訓練，針對特定用途 (例如物件偵測和區隔) 提升準確度。

將圖片傳送給 Gemini

你可以透過下列幾種方式，將圖片做為 Gemini 的輸入內容：

使用網址傳遞圖片：適合公開存取的圖片。
傳遞內嵌圖片資料：適用於 base64 編碼的圖片資料。
使用 File API 上傳圖片：建議用於較大的檔案，或在多個要求中重複使用圖片。

使用網址傳送圖片

您可以使用 Files API 上傳圖片，並在要求中傳遞圖片：

Python

from google import genai

client = genai.Client()

uploaded_file = client.files.upload(file="path/to/organ.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": uploaded_file.uri,
            "mime_type": uploaded_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const uploadedFile = await client.files.upload({
    file: "path/to/organ.jpg",
    config: { mimeType: "image/jpeg" }
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: uploadedFile.uri,
            mimeType: uploadedFile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file using the Files API, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

傳遞內嵌圖片資料

您可以提供採用 Base64 編碼的字串做為圖片資料：

Python

from google import genai

with open('path/to/small-sample.jpg', 'rb') as f:
    image_bytes = f.read()

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "data": base64.b64encode(image_bytes).decode('utf-8'),
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const client = new GoogleGenAI({});
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
  encoding: "base64",
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            data: base64ImageFile,
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

IMG_PATH="/path/to/your/image1.jpg"

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "data": "'"$(base64 $B64FLAGS $IMG_PATH)"'",
        "mime_type": "image/jpeg"
      }
    ]
  }'

使用 File API 上傳圖片

如要處理大型檔案或重複使用同一張圖片，請使用 Files API。請參閱 Files API 指南。

Python

from google import genai

client = genai.Client()

my_file = client.files.upload(file="path/to/sample.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": my_file.uri,
            "mime_type": my_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const myfile = await client.files.upload({
    file: "path/to/sample.jpg",
    config: { mimeType: "image/jpeg" },
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: myfile.uri,
            mime_type: myfile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file (see Files API guide for details)
# Then use the file URI in the request:

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

使用多張圖片撰寫提示

您可以在單一提示中提供多張圖片，方法是在 input 陣列中加入多個圖片物件：

Python

from google import genai

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "What is different between these two images?"},
        {
            "type": "image",
            "uri": "https://example.com/image1.jpg",
            "mime_type": "image/jpeg"
        },
        {
            "type": "image",
            "uri": "https://example.com/image2.jpg",
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "What is different between these two images?"},
        {
            type: "image",
            uri: "https://example.com/image1.jpg",
            mime_type: "image/jpeg"
        },
        {
            type: "image",
            uri: "https://example.com/image2.jpg",
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "What is different between these two images?"},
      {
        "type": "image",
        "uri": "https://example.com/image1.jpg",
        "mime_type": "image/jpeg"
      },
      {
        "type": "image",
        "uri": "https://example.com/image2.jpg",
        "mime_type": "image/jpeg"
      }
    ]
  }'

物件偵測

模型經過訓練後，可偵測圖片中的物件並取得定界框座標。相對於圖片尺寸的座標會縮放至 [0, 1000]。您需要根據原始圖片大小，縮放這些座標。

Python

from google import genai
from PIL import Image
import json

client = genai.Client()
prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    response_format={
        "type": "text",
        "mime_type": "application/json"
    }
)

bounding_boxes = json.loads(interaction.steps[-1].content[0].text)
print("Bounding boxes:", bounding_boxes)

如需更多範例，請參閱 Gemini 教戰手冊中的下列筆記本：

區隔

從 Gemini 2.5 開始，模型不僅能偵測項目，還能區隔項目並提供輪廓遮罩。

模型會預測 JSON 清單，其中每個項目都代表區隔遮罩。每個項目都有定界框 (「box_2d」)，格式為 [y0, x0, y1, x1]，其中包含介於 0 到 1000 之間的標準化座標、可識別物件的標籤 (「label」)，以及定界框內的區隔遮罩 (以 Base64 編碼的 PNG 格式，是值介於 0 到 255 之間的機率地圖)。

Python

from google import genai
from PIL import Image
import json

client = genai.Client()

prompt = """
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
"""

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    config={
        "thinking_level": "minimal"  # Minimize thinking for better detection results
    }
)

items = json.loads(interaction.steps[-1].content[0].text)
print("Segmentation results:", items)

桌上擺著杯子蛋糕，木製和玻璃物品以亮色標示 — 含有物件和區隔遮罩的區隔輸出範例

支援的圖片格式

Gemini 支援下列圖片格式 MIME 類型：

PNG - image/png
JPEG - image/jpeg
WebP - image/webp
HEIC - image/heic
HEIF - image/heif

如要瞭解其他檔案輸入方式，請參閱「檔案輸入方式」指南。

功能

所有 Gemini 模型版本都是多模態模型，可用於各種圖像處理和電腦視覺工作，包括但不限於圖像說明、視覺問答、圖像分類、物件偵測和分割。

視品質和效能需求而定，Gemini 可減少使用專業機器學習模型的需求。

最新模型版本經過特別訓練，除了強化物件偵測和區隔等一般功能外，還能提升特定工作的準確度。

限制和重要技術資訊

檔案限制

Gemini 模型每項要求最多可支援 3,600 個圖片檔案。

計算權杖

如果長邊和短邊都小於或等於 384 像素，則為 258 個權杖。較大的圖片會分割成 768x768 像素的圖塊，每個圖塊需支付 258 個權杖。

計算圖塊數量的粗略公式如下：

計算裁剪單元大小 (約為 floor(min(width, height) / 1.5)。
將每個維度除以裁剪單元大小，然後相乘，即可取得圖塊數量。

舉例來說，如果圖片尺寸為 960x540，裁剪單位大小為 360。將每個維度除以 360，圖塊數量為 3 * 2 = 6。

媒體解析度

Gemini 3 推出 media_resolution 參數，可精細控管多模態視覺處理作業。media_resolution 參數會決定每個輸入圖片或影片影格分配到的詞元數量上限。 解析度越高，模型就越能辨識細小文字或細節，但也會增加權杖用量和延遲時間。

提示與最佳做法

確認圖片已正確旋轉。
使用清晰的圖片，避免模糊不清。
使用含有文字的單一圖片時，請將文字提示詞放在 input 陣列中的圖片前面。

後續步驟

本指南說明如何上傳圖片檔案，以及如何從圖片輸入內容生成文字輸出內容。如要進一步瞭解相關內容，請參閱下列資源：

Files API：進一步瞭解如何上傳及管理檔案，以供 Gemini 使用。
系統指令：系統指令可根據特定需求和用途，引導模型行為。
檔案提示策略：Gemini API 支援使用文字、圖片、音訊和影片資料提示，也就是多模態提示。
安全指南：生成式 AI 模型有時會產生出乎意料的輸出內容，例如不準確、有偏見或令人反感的內容。後續處理和人工評估是不可或缺的環節，有助於降低這類輸出內容造成危害的風險。