圖像解讀
Gemini 模型從一開始就建構於多模態的基礎上,因此可執行各種圖像處理和電腦視覺工作,包括但不限於生成圖像說明、分類和回答圖像問題,無須訓練專門的機器學習模型。
除了提供一般多模態功能,Gemini 模型還透過額外訓練,針對特定用途 (例如物件偵測和區隔) 提升準確度。
將圖片傳送給 Gemini
你可以透過下列幾種方式,將圖片做為 Gemini 的輸入內容:
- 使用網址傳遞圖片:適合公開存取的圖片。
- 傳遞內嵌圖片資料:適用於 base64 編碼的圖片資料。
- 使用 File API 上傳圖片:建議用於較大的檔案,或在多個要求中重複使用圖片。
使用網址傳送圖片
您可以使用 Files API 上傳圖片,並在要求中傳遞圖片:
Python
from google import genai
client = genai.Client()
uploaded_file = client.files.upload(file="path/to/organ.jpg")
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"uri": uploaded_file.uri,
"mime_type": uploaded_file.mime_type
}
]
)
print(interaction.steps[-1].content[0].text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const uploadedFile = await client.files.upload({
file: "path/to/organ.jpg",
config: { mimeType: "image/jpeg" }
});
const interaction = await client.interactions.create({
model: "gemini-3-flash-preview",
input: [
{type: "text", text: "Caption this image."},
{
type: "image",
uri: uploadedFile.uri,
mimeType: uploadedFile.mimeType
}
]
});
console.log(interaction.steps.at(-1).content[0].text);
REST
# First upload the file using the Files API, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3-flash-preview",
"input": [
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"uri": "YOUR_FILE_URI",
"mime_type": "image/jpeg"
}
]
}'
傳遞內嵌圖片資料
您可以提供採用 Base64 編碼的字串做為圖片資料:
Python
from google import genai
with open('path/to/small-sample.jpg', 'rb') as f:
image_bytes = f.read()
client = genai.Client()
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"data": base64.b64encode(image_bytes).decode('utf-8'),
"mime_type": "image/jpeg"
}
]
)
print(interaction.steps[-1].content[0].text)
JavaScript
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const client = new GoogleGenAI({});
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
encoding: "base64",
});
const interaction = await client.interactions.create({
model: "gemini-3-flash-preview",
input: [
{type: "text", text: "Caption this image."},
{
type: "image",
data: base64ImageFile,
mime_type: "image/jpeg"
}
]
});
console.log(interaction.steps.at(-1).content[0].text);
REST
IMG_PATH="/path/to/your/image1.jpg"
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3-flash-preview",
"input": [
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"data": "'"$(base64 $B64FLAGS $IMG_PATH)"'",
"mime_type": "image/jpeg"
}
]
}'
使用 File API 上傳圖片
如要處理大型檔案或重複使用同一張圖片,請使用 Files API。請參閱 Files API 指南。
Python
from google import genai
client = genai.Client()
my_file = client.files.upload(file="path/to/sample.jpg")
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"uri": my_file.uri,
"mime_type": my_file.mime_type
}
]
)
print(interaction.steps[-1].content[0].text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const myfile = await client.files.upload({
file: "path/to/sample.jpg",
config: { mimeType: "image/jpeg" },
});
const interaction = await client.interactions.create({
model: "gemini-3-flash-preview",
input: [
{type: "text", text: "Caption this image."},
{
type: "image",
uri: myfile.uri,
mime_type: myfile.mimeType
}
]
});
console.log(interaction.steps.at(-1).content[0].text);
REST
# First upload the file (see Files API guide for details)
# Then use the file URI in the request:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3-flash-preview",
"input": [
{"type": "text", "text": "Caption this image."},
{
"type": "image",
"uri": "YOUR_FILE_URI",
"mime_type": "image/jpeg"
}
]
}'
使用多張圖片撰寫提示
您可以在單一提示中提供多張圖片,方法是在 input 陣列中加入多個圖片物件:
Python
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": "What is different between these two images?"},
{
"type": "image",
"uri": "https://example.com/image1.jpg",
"mime_type": "image/jpeg"
},
{
"type": "image",
"uri": "https://example.com/image2.jpg",
"mime_type": "image/jpeg"
}
]
)
print(interaction.steps[-1].content[0].text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const interaction = await client.interactions.create({
model: "gemini-3-flash-preview",
input: [
{type: "text", text: "What is different between these two images?"},
{
type: "image",
uri: "https://example.com/image1.jpg",
mime_type: "image/jpeg"
},
{
type: "image",
uri: "https://example.com/image2.jpg",
mime_type: "image/jpeg"
}
]
});
console.log(interaction.steps.at(-1).content[0].text);
REST
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3-flash-preview",
"input": [
{"type": "text", "text": "What is different between these two images?"},
{
"type": "image",
"uri": "https://example.com/image1.jpg",
"mime_type": "image/jpeg"
},
{
"type": "image",
"uri": "https://example.com/image2.jpg",
"mime_type": "image/jpeg"
}
]
}'
物件偵測
模型經過訓練後,可偵測圖片中的物件並取得定界框座標。相對於圖片尺寸的座標會縮放至 [0, 1000]。您需要根據原始圖片大小,縮放這些座標。
Python
from google import genai
from PIL import Image
import json
client = genai.Client()
prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": prompt},
{
"type": "image",
"uri": "https://example.com/image.png",
"mime_type": "image/png"
}
],
response_format={
"type": "text",
"mime_type": "application/json"
}
)
bounding_boxes = json.loads(interaction.steps[-1].content[0].text)
print("Bounding boxes:", bounding_boxes)
如需更多範例,請參閱 Gemini 教戰手冊中的下列筆記本:
區隔
從 Gemini 2.5 開始,模型不僅能偵測項目,還能區隔項目並提供輪廓遮罩。
模型會預測 JSON 清單,其中每個項目都代表區隔遮罩。每個項目都有定界框 (「box_2d」),格式為 [y0, x0, y1, x1],其中包含介於 0 到 1000 之間的標準化座標、可識別物件的標籤 (「label」),以及定界框內的區隔遮罩 (以 Base64 編碼的 PNG 格式,是值介於 0 到 255 之間的機率地圖)。
Python
from google import genai
from PIL import Image
import json
client = genai.Client()
prompt = """
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
"""
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input=[
{"type": "text", "text": prompt},
{
"type": "image",
"uri": "https://example.com/image.png",
"mime_type": "image/png"
}
],
config={
"thinking_level": "minimal" # Minimize thinking for better detection results
}
)
items = json.loads(interaction.steps[-1].content[0].text)
print("Segmentation results:", items)
支援的圖片格式
Gemini 支援下列圖片格式 MIME 類型:
- PNG -
image/png - JPEG -
image/jpeg - WebP -
image/webp - HEIC -
image/heic - HEIF -
image/heif
如要瞭解其他檔案輸入方式,請參閱「檔案輸入方式」指南。
功能
所有 Gemini 模型版本都是多模態模型,可用於各種圖像處理和電腦視覺工作,包括但不限於圖像說明、視覺問答、圖像分類、物件偵測和分割。
視品質和效能需求而定,Gemini 可減少使用專業機器學習模型的需求。
最新模型版本經過特別訓練,除了強化物件偵測和區隔等一般功能外,還能提升特定工作的準確度。
限制和重要技術資訊
檔案限制
Gemini 模型每項要求最多可支援 3,600 個圖片檔案。
計算權杖
- 如果長邊和短邊都小於或等於 384 像素,則為 258 個權杖。 較大的圖片會分割成 768x768 像素的圖塊,每個圖塊需支付 258 個權杖。
計算圖塊數量的粗略公式如下:
- 計算裁剪單元大小 (約為
floor(min(width, height)/ 1.5)。 - 將每個維度除以裁剪單元大小,然後相乘,即可取得圖塊數量。
舉例來說,如果圖片尺寸為 960x540,裁剪單位大小為 360。將每個維度除以 360,圖塊數量為 3 * 2 = 6。
媒體解析度
Gemini 3 推出 media_resolution 參數,可精細控管多模態視覺處理作業。media_resolution 參數會決定每個輸入圖片或影片影格分配到的詞元數量上限。
解析度越高,模型就越能辨識細小文字或細節,但也會增加權杖用量和延遲時間。
提示與最佳做法
- 確認圖片已正確旋轉。
- 使用清晰的圖片,避免模糊不清。
- 使用含有文字的單一圖片時,請將文字提示詞放在
input陣列中的圖片前面。
後續步驟
本指南說明如何上傳圖片檔案,以及如何從圖片輸入內容生成文字輸出內容。如要進一步瞭解相關內容,請參閱下列資源: