Gemini 2.0 Flash 現已準備好投入實際使用！瞭解詳情

本頁面由 Cloud Translation API 翻譯而成。

運用 Gemini API 探索視覺功能

在 ai.google.dev 上查看

試用 Colab 筆記本

在 GitHub 中查看筆記本

Gemini 模型可處理圖片和影片，因此開發人員可利用這項功能實現許多前所未見的用途，而這類用途過去需要使用特定領域的模型。Gemini 的視覺功能包括：

為圖片加上說明文字，並回答圖片相關問題
轉錄及推論 PDF，包括最多 200 萬個符記
描述、分割及擷取長達 90 分鐘的影片資訊
偵測圖片中的物件，並傳回物件的定界框座標

Gemini 從一開始就是以多模態為設計宗旨，我們會持續突破 AI 技術的極限。

圖片輸入

如果圖片酬載總大小小於 20 MB，建議您上傳 Base64 編碼的圖片，或直接上傳儲存在本機的圖片檔案。

使用本機映像檔

如果您使用的是 Python 影像處理程式庫 (Pillow)，也可以使用 PIL 圖片物件。

from google import genai
from google.genai import types

import PIL.Image

image = PIL.Image.open('/path/to/image.png')

client = genai.Client(api_key="GEMINI_API_KEY")
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=["What is this image?", image])

print(response.text)

Base64 編碼圖片

您可以將公開圖片網址編碼為 Base64 酬載，然後上傳。以下程式碼範例說明如何只使用標準程式庫工具執行此操作：

from google import genai
from google.genai import types

import requests

image_path = "https://goo.gle/instrument-img"
image = requests.get(image_path)

client = genai.Client(api_key="GEMINI_API_KEY")
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["What is this image?",
              types.Part.from_bytes(data=image.content, mime_type="image/jpeg")])

print(response.text)

多張圖片

如要提示多張圖片，您可以在呼叫 generate_content 時提供多張圖片。這些圖片可以是任何支援的格式，包括 base64 或 PIL。

from google import genai
from google.genai import types

import pathlib
import PIL.Image

image_path_1 = "path/to/your/image1.jpeg"  # Replace with the actual path to your first image
image_path_2 = "path/to/your/image2.jpeg" # Replace with the actual path to your second image

image_url_1 = "https://goo.gle/instrument-img" # Replace with the actual URL to your third image

pil_image = PIL.Image.open(image_path_1)

b64_image = types.Part.from_bytes(
    data=pathlib.Path(image_path_2).read_bytes(),
    mime_type="image/jpeg"
)

downloaded_image = requests.get(image_url_1)

client = genai.Client(api_key="GEMINI_API_KEY")
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["What do these images have in common?",
              pil_image, b64_image, downloaded_image])

print(response.text)

請注意，這些內嵌資料呼叫不包含透過 File API 提供的許多功能，例如取得檔案中繼資料、列出或刪除檔案。

大型圖片酬載

如果您要傳送的檔案和系統指示總大小超過 20 MB，請使用 File API 上傳這些檔案。

使用 File API 的 media.upload 方法，上傳任何大小的圖片。

上傳檔案後，您可以提出參照 File API URI 的 GenerateContent 要求。選取生成式模型，並提供文字提示和上傳的圖片。

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

img_path = "/path/to/Cajun_instruments.jpg"
file_ref = client.files.upload(file=img_path)
print(f'{file_ref=}')

client = genai.Client(api_key="GEMINI_API_KEY")
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["What can you tell me about these instruments?",
              file_ref])

print(response.text)

OpenAI 相容性

您可以使用 OpenAI 程式庫存取 Gemini 的圖像理解功能。這樣一來，您就能更新三行程式碼並使用 Gemini API 金鑰，將 Gemini 整合至現有的 OpenAI 工作流程。請參閱圖像理解範例，瞭解如何傳送以 Base64 酬載編碼的圖片。

使用圖片提示

在本教學課程中，您將使用 File API 上傳圖片，或將圖片做為內嵌資料，並根據這些圖片產生內容。

技術詳細資料 (圖片)

Gemini 2.0 Flash、1.5 Pro 和 1.5 Flash 最多可支援 3,600 個圖片檔案。

圖片必須採用下列圖片資料 MIME 類型：

PNG - image/png
JPEG - image/jpeg
WEBP - image/webp
HEIC - image/heic
HEIF - image/heif

詞元

以下是圖片的符記計算方式：

Gemini 1.0 Pro Vision：每張圖片占 258 個符號。
Gemini 1.5 Flash 和 Gemini 1.5 Pro：如果圖片的兩個尺寸都小於或等於 384 像素，則會使用 258 個符記。如果圖片的一個尺寸大於 384 像素，系統會將圖片裁剪成圖塊。每個圖塊大小的預設值為最小尺寸 (寬度或高度) 除以 1.5。必要時，系統會調整每個圖塊，使其大小不小於 256 像素，也不大於 768 像素。接著，每個資訊方塊的大小都會調整為 768x768，並使用 258 個符記。
Gemini 2.0 Flash：如果圖片輸入的兩個尺寸都小於或等於 384 像素，系統會將其計為 258 個符記。圖片的其中一個或兩個尺寸過大時，系統會視需要裁剪及縮放圖片，使其成為 768 x 768 像素的圖塊，每個圖塊會計為 258 個符記。

為獲得最佳成效

請在上傳前將圖片旋轉至正確方向。
請避免使用模糊的圖片。
如果使用單張圖片，請將文字提示放在圖片後方。

功能

本節將概略說明 Gemini 模型的特定視覺功能，包括物件偵測和定界框座標。

取得物件的邊界框

經過訓練後，Gemini 模型會將定界框座標以 [0, 1] 範圍內的相對寬度或高度傳回。然後，這些值會以 1000 的比例縮放，並轉換為整數。實際上，座標代表圖片 1000x1000 像素版本上的邊界框。因此，您需要將這些座標轉換回原始圖片的尺寸，才能準確對應定界框。

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

prompt = (
  "Return a bounding box for each of the objects in this image "
  "in [ymin, xmin, ymax, xmax] format.")

response = client.models.generate_content(
  model="gemini-1.5-pro",
  contents=[sample_file_1, prompt])

print(response.text)

您可以使用定界框偵測圖片和影片中的物件，並進行本地化。透過準確辨識物件並使用定界框來劃分物件，您就能發揮多種應用程式，並提升專案的智慧程度。

主要優點

簡單：無論您是否具備電腦視覺專業知識，都能輕鬆將物件偵測功能整合至應用程式。
可自訂：根據自訂指示 (例如「我想查看這張圖片中所有綠色物件的邊界框」) 產生邊界框，無須訓練自訂模型。

技術詳細資料

輸入內容：提示內容和相關圖片或影片影格。
輸出：定界框，格式為 [y_min, x_min, y_max, x_max]。左上角是原點。x 和 y 軸分別為水平和垂直。每張圖片的座標值都會正規化為 0 到 1000。
視覺化：AI Studio 使用者會在 UI 中看到邊界框。

Python 開發人員可以試試 2D 空間理解筆記本或實驗性 3D 指標筆記本。

將座標標準化

模型會以 [y_min, x_min, y_max, x_max] 格式傳回定界框座標。如要將這些標準化座標轉換為原始圖片的像素座標，請按照下列步驟操作：

將每個輸出座標除以 1000。
將 x 座標乘以原始圖片寬度。
將 y 座標乘以原始圖片高度。

如要進一步瞭解如何產生邊界框座標並在圖片上顯示，建議您參閱物件檢測食譜範例。

使用影片提示

在本教學課程中，您將使用 File API 上傳影片，並根據這些圖片產生內容。

技術詳細資料 (影片)

Gemini 1.5 Pro 和 Flash 最多可支援約一小時的影片資料。

影片必須採用下列任一影片格式 MIME 類型：

video/mp4
video/mpeg
video/mov
video/avi
video/x-flv
video/mpg
video/webm
video/wmv
video/3gpp

File API 服務會以每秒 1 格 (FPS) 的速度從影片中擷取圖像影格，並以 1 Kbps 的速度擷取單一頻道的音訊，每秒加入時間戳記。這些費率日後可能會因推論功能的改善而有所變動。

個別影格為 258 個符記，音訊則為每秒 32 個符記。使用中繼資料後，每秒影片會變成約 300 個符記，也就是說，100 萬個脈絡窗口最多可容納略低於一小時的影片。

如要詢問有關時間戳記位置的問題，請使用 MM:SS 格式，其中前兩位數字代表分鐘，後兩位數字代表秒數。

為確保最佳成效：

每個提示使用一支影片。
如果使用單一影片，請將文字提示放在影片後方。

使用 File API 上傳影片檔案

File API 可直接接受影片檔案格式。本範例使用 NASA 的短片「Jupiter's Great Red Spot Shrinks and Grows」。圖片來源：Goddard Space Flight Center (GSFC)/David Ladd (2018)。

「Jupiter's Great Red Spot Shrinks and Grows」屬於公有領域，且沒有可識別的人物。(NASA 圖片和媒體使用規範)

請先擷取短片：

wget https://storage.googleapis.com/generativeai-downloads/images/GreatRedSpot.mp4

使用 File API 上傳影片，並列印 URI。

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

print("Uploading file...")
video_file = client.files.upload(file="GreatRedSpot.mp4")
print(f"Completed upload: {video_file.uri}")

驗證檔案上傳作業並檢查狀態

呼叫 files.get 方法，確認 API 已成功接收檔案。

import time

# Check whether the file is ready to be used.
while video_file.state.name == "PROCESSING":
    print('.', end='')
    time.sleep(1)
    video_file = client.files.get(name=video_file.name)

if video_file.state.name == "FAILED":
  raise ValueError(video_file.state.name)

print('Done')

使用影片和文字提示

上傳的影片處於 ACTIVE 狀態後，您可以發出 GenerateContent 要求，指定該影片的 File API URI。選取生成模型，並提供上傳的影片和文字提示。

from IPython.display import Markdown

# Pass the video file reference like any other media part.
response = client.models.generate_content(
    model="gemini-1.5-pro",
    contents=[
        video_file,
        "Summarize this video. Then create a quiz with answer key "
        "based on the information in the video."])

# Print the response, rendering any Markdown
Markdown(response.text)

參考內容中的時間戳記

您可以使用 HH:MM:SS 格式的時間戳記，參照影片中的特定片段。

prompt = "What are the examples given at 01:05 and 01:19 supposed to show us?"

response = client.models.generate_content(
    model="gemini-1.5-pro",
    contents=[video_file, prompt])

print(response.text)

轉錄影片並提供視覺描述

Gemini 模型可同時處理音軌和影像影格，並為影片內容轉錄並提供視覺說明。針對視覺描述，模型會以每秒 1 格的速度取樣影片。這項取樣率可能會影響說明的詳細程度，特別是針對視覺效果快速變化的影片。

prompt = (
    "Transcribe the audio from this video, giving timestamps for "
    "salient events in the video. Also provide visual descriptions.")

response = client.models.generate_content(
    model="gemini-1.5-pro",
    contents=[video_file, prompt])

print(response.text)

可列出檔案

您可以使用 files.list 列出所有使用 File API 上傳的檔案，以及這些檔案的 URI。

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

print('My files:')
for f in client.files.list():
  print(" ", f'{f.name}: {f.uri}')

刪除檔案

使用 File API 上傳的檔案會在 2 天後自動刪除。您也可以使用 files.delete 手動刪除這些項目。

from google import genai

client = genai.Client(api_key="GEMINI_API_KEY")

# Upload a file
poem_file = client.files.upload(file="poem.txt")

# Files will auto-delete after a period.
print(poem_file.expiration_time)

# Or they can be deleted explicitly.
dr = client.files.delete(name=poem_file.name)

try:
  client.models.generate_content(
      model="gemini-2.0-flash-exp",
      contents=['Finish this poem:', poem_file])
except genai.errors.ClientError as e:
  print(e.code)  # 403
  print(e.status)  # PERMISSION_DENIED
  print(e.message)  # You do not have permission to access the File .. or it may not exist.

後續步驟

本指南說明如何使用 File API 上傳圖片和影片檔案，然後根據圖片和影片輸入內容產生文字輸出內容。如要進一步瞭解相關內容，請參閱下列資源：

檔案提示策略：Gemini API 支援使用文字、圖片、音訊和影片資料提示，這也稱為多模態提示。
系統指示：系統指示可讓您根據特定需求和用途，引導模型的行為。
安全指南：生成式 AI 模型有時會產生意外的輸出內容，例如不準確、有偏見或令人反感的輸出內容。後續處理和人工評估是限制這類輸出內容造成危害風險的必要措施。