Gemini 2.5 Pro 預先發布版現已可供正式使用！瞭解詳情

本頁面由 Cloud Translation API 翻譯而成。

圖像理解

Gemini 模型可處理圖片，因此開發人員可利用許多前衛的開發案例，而這些案例在過去需要專屬領域的模型。Gemini 的視覺功能包括：

為圖片加上說明文字，並回答圖片相關問題
轉錄及分析 PDF，包括最多 200 萬個符記
偵測圖片中的物件，並傳回物件的定界框座標
區分圖片中的物件

Gemini 從一開始就以多模態為設計宗旨，我們會持續突破 AI 技術的極限。本指南說明如何使用 Gemini API，根據圖片輸入內容產生文字回覆，並執行常見的圖像理解工作。

事前準備

呼叫 Gemini API 前，請確認您已安裝所選 SDK，並設定 Gemini API 金鑰，以便使用。

圖片輸入

您可以透過下列方式，將圖片做為 Gemini 的輸入內容：

請先使用 File API 上傳圖片檔案，再向 generateContent 提出要求。請在檔案大小超過 20 MB 或您想在多個要求中重複使用檔案時，使用這個方法。
透過要求傳遞內嵌圖片資料至 generateContent。請針對較小的檔案 (總要求大小小於 20 MB) 或直接從網址擷取的圖片使用此方法。

上傳圖片檔案

您可以使用 Files API 上傳圖片檔案。如果要求總大小 (包括檔案、文字提示、系統指示等) 超過 20 MB，或是您打算在多個提示中使用相同圖片，請務必使用 Files API。

以下程式碼會上傳圖片檔案，然後在對 generateContent 的呼叫中使用該檔案。

PythonJavaScriptGoREST

from google import genai

client = genai.Client(api_key="GOOGLE_API_KEY")

myfile = client.files.upload(file="path/to/sample.jpg")

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[myfile, "Caption this image."])

print(response.text)

import {
  GoogleGenAI,
  createUserContent,
  createPartFromUri,
} from "@google/genai";

const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });

async function main() {
  const myfile = await ai.files.upload({
    file: "path/to/sample.jpg",
    config: { mimeType: "image/jpeg" },
  });

  const response = await ai.models.generateContent({
    model: "gemini-2.0-flash",
    contents: createUserContent([
      createPartFromUri(myfile.uri, myfile.mimeType),
      "Caption this image.",
    ]),
  });
  console.log(response.text);
}

await main();

file, err := client.UploadFileFromPath(ctx, "path/to/sample.jpg", nil)
if err != nil {
    log.Fatal(err)
}
defer client.DeleteFile(ctx, file.Name)

model := client.GenerativeModel("gemini-2.0-flash")
resp, err := model.GenerateContent(ctx,
    genai.FileData{URI: file.URI},
    genai.Text("Caption this image."))
if err != nil {
    log.Fatal(err)
}

printResponse(resp)

IMAGE_PATH="path/to/sample.jpg"
MIME_TYPE=$(file -b --mime-type "${IMAGE_PATH}")
NUM_BYTES=$(wc -c < "${IMAGE_PATH}")
DISPLAY_NAME=IMAGE

tmp_header_file=upload-header.tmp

# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
  -D upload-header.tmp \
  -H "X-Goog-Upload-Protocol: resumable" \
  -H "X-Goog-Upload-Command: start" \
  -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
  -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
  -H "Content-Type: application/json" \
  -d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null

upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"

# Upload the actual bytes.
curl "${upload_url}" \
  -H "Content-Length: ${NUM_BYTES}" \
  -H "X-Goog-Upload-Offset: 0" \
  -H "X-Goog-Upload-Command: upload, finalize" \
  --data-binary "@${IMAGE_PATH}" 2> /dev/null > file_info.json

file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri

# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{
      "contents": [{
        "parts":[
          {"file_data":{"mime_type": "${MIME_TYPE}", "file_uri": '$file_uri'}},
          {"text": "Caption this image."}]
        }]
      }' 2> /dev/null > response.json

cat response.json
echo

jq ".candidates[].content.parts[].text" response.json

如要進一步瞭解如何使用媒體檔案，請參閱 Files API。

內嵌傳遞圖片資料

您可以改為在要求中傳遞內嵌圖片資料，而非上傳圖片檔案。generateContent這類做法適合用於較小的圖片 (總要求大小小於 20 MB)，或直接從網址擷取的圖片。

您可以以 Base64 編碼字串的形式提供圖片資料，也可以直接讀取本機檔案 (視 SDK 而定)。

本機圖片檔案：

PythonJavaScriptGoREST

  from google.genai import types

  with open('path/to/small-sample.jpg', 'rb') as f:
      img_bytes = f.read()

  response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[
      types.Part.from_bytes(
        data=img_bytes,
        mime_type='image/jpeg',
      ),
      'Caption this image.'
    ]
  )

  print(response.text)

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
  encoding: "base64",
});

const contents = [
  {
    inlineData: {
      mimeType: "image/jpeg",
      data: base64ImageFile,
    },
  },
  { text: "Caption this image." },
];

const response = await ai.models.generateContent({
  model: "gemini-2.0-flash",
  contents: contents,
});
console.log(response.text);

model := client.GenerativeModel("gemini-2.0-flash")

bytes, err := os.ReadFile("path/to/small-sample.jpg")
if err != nil {
  log.Fatal(err)
}

prompt := []genai.Part{
  genai.Blob{MIMEType: "image/jpeg", Data: bytes},
  genai.Text("Caption this image."),
}

resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
  log.Fatal(err)
}

for _, c := range resp.Candidates {
  if c.Content != nil {
    fmt.Println(*c.Content)
  }
}

IMG_PATH=/path/to/your/image1.jpg

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{
      "contents": [{
        "parts":[
            {
              "inline_data": {
                "mime_type":"image/jpeg",
                "data": "'\$(base64 \$B64FLAGS \$IMG_PATH)'"
              }
            },
            {"text": "Caption this image."},
        ]
      }]
    }' 2> /dev/null

圖片網址：

PythonJavaScriptGoREST

from google import genai
from google.genai import types

import requests

image_path = "https://goo.gle/instrument-img"
image = requests.get(image_path)

client = genai.Client(api_key="GOOGLE_API_KEY")
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["What is this image?",
              types.Part.from_bytes(data=image.content, mime_type="image/jpeg")])

print(response.text)

import { GoogleGenAI } from "@google/genai";

async function main() {
  const ai = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });

  const imageUrl = "https://goo.gle/instrument-img";

  const response = await fetch(imageUrl);
  const imageArrayBuffer = await response.arrayBuffer();
  const base64ImageData = Buffer.from(imageArrayBuffer).toString('base64');

  const result = await ai.models.generateContent({
    model: "gemini-2.0-flash",
    contents: [
    {
      inlineData: {
        mimeType: 'image/jpeg',
        data: base64ImageData,
      },
    },
    { text: "Caption this image." }
  ],
  });
  console.log(result.text);
}

main();

func main() {
ctx := context.Background()
client, err := genai.NewClient(ctx, option.WithAPIKey(os.Getenv("GOOGLE_API_KEY")))
if err != nil {
  log.Fatal(err)
}
defer client.Close()

model := client.GenerativeModel("gemini-2.0-flash")

// Download the image.
imageResp, err := http.Get("https://goo.gle/instrument-img")
if err != nil {
  panic(err)
}
defer imageResp.Body.Close()

imageBytes, err := io.ReadAll(imageResp.Body)
if err != nil {
  panic(err)
}

// Create the request.
req := []genai.Part{
  genai.ImageData("jpeg", imageBytes),

  genai.Text("Caption this image."),
}

// Generate content.
resp, err := model.GenerateContent(ctx, req...)
if err != nil {
  panic(err)
}

// Handle the response of generated text.
for _, c := range resp.Candidates {
  if c.Content != nil {
    fmt.Println(*c.Content)
  }
}

}

IMG_URL="https://goo.gle/instrument-img"

MIME_TYPE=$(curl -sIL "$IMG_URL" | grep -i '^content-type:' | awk -F ': ' '{print $2}' | sed 's/\r$//' | head -n 1)
if [[ -z "$MIME_TYPE" || ! "$MIME_TYPE" == image/* ]]; then
  MIME_TYPE="image/jpeg"
fi

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{
      "contents": [{
        "parts":[
            {
              "inline_data": {
                "mime_type":"'"$MIME_TYPE"'",
                "data": "'$(curl -sL "$IMG_URL" | base64 $B64FLAGS)'"
              }
            },
            {"text": "Caption this image."}
        ]
      }]
    }' 2> /dev/null

關於內嵌圖片資料，請注意以下幾點：

總要求大小上限為 20 MB，其中包括文字提示、系統操作說明和所有內嵌檔案。如果檔案大小會導致總要求大小超過 20 MB，請使用 Files API 上傳圖片檔案，以便在要求中使用。
如果您要多次使用圖片樣本，建議使用 File API 上傳圖片檔案，這樣會更有效率。

使用多張圖片提示

您可以在 contents 陣列中加入多個圖片 Part 物件，在單一提示中提供多張圖片。這些資料可以是內嵌資料 (本機檔案或網址) 和 File API 參照的組合。

PythonJavaScriptGoREST

from google import genai
from google.genai import types

client = genai.Client(api_key="GOOGLE_API_KEY")

# Upload the first image
image1_path = "path/to/image1.jpg"
uploaded_file = client.files.upload(file=image1_path)

# Prepare the second image as inline data
image2_path = "path/to/image2.png"
with open(image2_path, 'rb') as f:
    img2_bytes = f.read()

# Create the prompt with text and multiple images
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[
        "What is different between these two images?",
        uploaded_file,  # Use the uploaded file reference
        types.Part.from_bytes(
            data=img2_bytes,
            mime_type='image/png'
        )
    ]
)

print(response.text)

import {
  GoogleGenAI,
  createUserContent,
  createPartFromUri,
} from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });

async function main() {
  // Upload the first image
  const image1_path = "path/to/image1.jpg";
  const uploadedFile = await ai.files.upload({
    file: image1_path,
    config: { mimeType: "image/jpeg" },
  });

  // Prepare the second image as inline data
  const image2_path = "path/to/image2.png";
  const base64Image2File = fs.readFileSync(image2_path, {
    encoding: "base64",
  });

  // Create the prompt with text and multiple images
  const response = await ai.models.generateContent({
    model: "gemini-2.0-flash",
    contents: createUserContent([
      "What is different between these two images?",
      createPartFromUri(uploadedFile.uri, uploadedFile.mimeType),
      {
        inlineData: {
          mimeType: "image/png",
          data: base64Image2File,
        },
      },
    ]),
  });
  console.log(response.text);
}

await main();

+    // Upload the first image
image1Path := "path/to/image1.jpg"
uploadedFile, err := client.UploadFileFromPath(ctx, image1Path, nil)
if err != nil {
    log.Fatal(err)
}
defer client.DeleteFile(ctx, uploadedFile.Name)

// Prepare the second image as inline data
image2Path := "path/to/image2.png"
img2Bytes, err := os.ReadFile(image2Path)
if err != nil {
  log.Fatal(err)
}

// Create the prompt with text and multiple images
model := client.GenerativeModel("gemini-2.0-flash")
prompt := []genai.Part{
  genai.Text("What is different between these two images?"),
  genai.FileData{URI: uploadedFile.URI},
  genai.Blob{MIMEType: "image/png", Data: img2Bytes},
}

resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
  log.Fatal(err)
}

printResponse(resp)

# Upload the first image
IMAGE1_PATH="path/to/image1.jpg"
MIME1_TYPE=$(file -b --mime-type "${IMAGE1_PATH}")
NUM1_BYTES=$(wc -c < "${IMAGE1_PATH}")
DISPLAY_NAME1=IMAGE1

tmp_header_file1=upload-header1.tmp

curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
  -D upload-header1.tmp \
  -H "X-Goog-Upload-Protocol: resumable" \
  -H "X-Goog-Upload-Command: start" \
  -H "X-Goog-Upload-Header-Content-Length: ${NUM1_BYTES}" \
  -H "X-Goog-Upload-Header-Content-Type: ${MIME1_TYPE}" \
  -H "Content-Type: application/json" \
  -d "{'file': {'display_name': '${DISPLAY_NAME1}'}}" 2> /dev/null

upload_url1=$(grep -i "x-goog-upload-url: " "${tmp_header_file1}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file1}"

curl "${upload_url1}" \
  -H "Content-Length: ${NUM1_BYTES}" \
  -H "X-Goog-Upload-Offset: 0" \
  -H "X-Goog-Upload-Command: upload, finalize" \
  --data-binary "@${IMAGE1_PATH}" 2> /dev/null > file_info1.json

file1_uri=$(jq ".file.uri" file_info1.json)
echo file1_uri=$file1_uri

# Prepare the second image (inline)
IMAGE2_PATH="path/to/image2.png"
MIME2_TYPE=$(file -b --mime-type "${IMAGE2_PATH}")

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi
IMAGE2_BASE64=$(base64 $B64FLAGS $IMAGE2_PATH)

# Now generate content using both images
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{
      "contents": [{
        "parts":[
          {"text": "What is different between these two images?"},
          {"file_data":{"mime_type": "'"${MIME1_TYPE}"'", "file_uri": '$file1_uri'}},
          {
            "inline_data": {
              "mime_type":"'"${MIME2_TYPE}"'",
              "data": "'"$IMAGE2_BASE64"'"
            }
          }
        ]
      }]
    }' 2> /dev/null > response.json

cat response.json
echo

jq ".candidates[].content.parts[].text" response.json

取得物件的邊界框

Gemini 模型經過訓練，可識別圖片中的物件，並提供定界框座標。系統會根據圖片尺寸傳回座標，並將其縮放至 [0, 1000]。您必須根據原始圖片大小縮減這些座標。

PythonJavaScriptGoREST

prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."

const prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000.";

prompt := []genai.Part{
    genai.FileData{URI: sampleImage.URI},
    genai.Text("Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."),
}

PROMPT="Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."

您可以使用定界框偵測圖片和影片中的物件，並進行本地化。透過準確辨識物件並使用定界框來劃分物件，您就能發揮多種應用程式，並提升專案的智慧程度。

主要優點

簡單：無論您是否具備電腦視覺專業知識，都能輕鬆將物件偵測功能整合至應用程式。
可自訂：根據自訂指示 (例如「我想查看這張圖片中所有綠色物件的邊界框」) 產生邊界框，無須訓練自訂模型。

技術詳細資料

輸入內容：提示內容和相關圖片或影片影格。
輸出：定界框，格式為 [y_min, x_min, y_max, x_max]。左上角是原點。x 和 y 軸分別為水平和垂直。每張圖片的座標值都會正規化為 0 到 1000。
視覺化：AI Studio 使用者會在 UI 中看到邊界框。

Python 開發人員可試試 2D 空間理解筆記本或實驗性 3D 指標筆記本。

將座標標準化

模型會以 [y_min, x_min, y_max, x_max] 格式傳回定界框座標。如要將這些標準化座標轉換為原始圖片的像素座標，請按照下列步驟操作：

將每個輸出座標除以 1000。
將 x 座標乘以原始圖片寬度。
將 y 座標乘以原始圖片高度。

如要進一步瞭解如何產生邊界框座標並在圖片上顯示，請參閱物件偵測食譜範例。

圖片區隔

從 Gemini 2.5 模型開始，Gemini 模型的訓練目標不僅是偵測項目，還要分割項目並提供輪廓遮罩。

模型會預測 JSON 清單，其中每個項目代表一個區隔遮罩。每個項目都有一個定界框 ("box_2d")，格式為 [y0, x0, y1, x1]，其規範化座標介於 0 和 1000 之間，標籤 ("label") 可識別物件，最後是定界框內的區隔遮罩，以 base64 編碼的 png 為基礎，這是值介於 0 和 255 之間的機率圖。遮罩的大小必須與邊界框尺寸相符，然後以可信度門檻 (中點為 127) 進行二值化。

PythonJavaScriptGoREST

prompt = """
  Give the segmentation masks for the wooden and glass items.
  Output a JSON list of segmentation masks where each entry contains the 2D
  bounding box in the key "box_2d", the segmentation mask in key "mask", and
  the text label in the key "label". Use descriptive labels.
"""

const prompt = `
  Give the segmentation masks for the wooden and glass items.
  Output a JSON list of segmentation masks where each entry contains the 2D
  bounding box in the key "box_2d", the segmentation mask in key "mask", and
  the text label in the key "label". Use descriptive labels.
`;

prompt := []genai.Part{
    genai.FileData{URI: sampleImage.URI},
    genai.Text(`
      Give the segmentation masks for the wooden and glass items.
      Output a JSON list of segmentation masks where each entry contains the 2D
      bounding box in the key "box_2d", the segmentation mask in key "mask", and
      the text label in the key "label". Use descriptive labels.
    `),
}

PROMPT='''
  Give the segmentation masks for the wooden and glass items.
  Output a JSON list of segmentation masks where each entry contains the 2D
  bounding box in the key "box_2d", the segmentation mask in key "mask", and
  the text label in the key "label". Use descriptive labels.
'''

如需更詳細的範例，請參閱食譜指南中的區隔範例。

支援的圖片格式

Gemini 支援下列圖片格式的 MIME 類型：

PNG - image/png
JPEG - image/jpeg
WEBP - image/webp
HEIC - image/heic
HEIF - image/heif

圖片的技術細節

檔案限制：Gemini 2.5 Pro、2.0 Flash、1.5 Pro 和 1.5 Flash 每個要求最多支援 3,600 個圖片檔案。
符記計算：
- Gemini 1.5 Flash 和 Gemini 1.5 Pro：如果兩個尺寸均小於 384 像素，則為 258 個符記。較大的圖片會以平鋪方式顯示 (最小圖塊 256 像素，最大 768 像素，並調整為 768x768 像素)，每個圖塊的符記費用為 258 個。
- Gemini 2.0 Flash：如果兩個尺寸均小於 384 像素，則為 258 個符記。較大的圖片會分割成 768x768 像素的圖塊，每個圖塊的符記費用為 258 個。
最佳做法：
- 確保圖片旋轉正確。
- 使用清晰且不模糊的圖片。
- 使用單張含文字圖片時，請將文字提示放在 contents 陣列中的圖片部分後方。

後續步驟

本指南說明如何上傳圖片檔案，並從圖片輸入內容產生文字輸出內容。如要進一步瞭解相關內容，請參閱下列資源：

系統指令：系統指令可讓您根據特定需求和用途，控制模型的行為。
影片理解：瞭解如何使用影片輸入內容。
Files API：進一步瞭解如何上傳及管理 Gemini 使用的檔案。
檔案提示策略：Gemini API 支援使用文字、圖片、音訊和影片資料提示，這也稱為多模態提示。
安全指南：生成式 AI 模型有時會產生不預期的輸出內容，例如不準確、有偏見或令人反感的輸出內容。後續處理和人工評估是限制這類輸出內容造成危害風險的必要措施。