Gemini Deep Research がプレビュー版で利用可能になりました。共同プランニング、可視化、MCP サポートなどが含まれています。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

画像理解

注: このページのバージョンでは、現在ベータ版の新しい Interactions API について説明しています。
安定した本番環境デプロイメントの場合は、引き続き generateContent API を使用することをおすすめします。このページの切り替えボタンを使用して、バージョンを切り替えることができます。

Gemini モデルは、マルチモーダル AI として一から構築されています。そのため、専用の ML モデルをトレーニングしなくても、画像キャプション、分類、Visual Question & Answering など、幅広い画像処理タスクやコンピュータビジョンタスクを実行できます。

Gemini モデルは、一般的なマルチモーダル機能に加えて、追加のトレーニングにより、オブジェクト検出やセグメンテーションなどの特定のユースケースで 精度が向上します。

Gemini に画像を渡す

Gemini に入力として画像を提供するには、次の方法を使用します。

URL を使用して画像を渡す: 一般公開されている画像に最適です。
インライン画像データを渡す: base64 エンコードされた画像データの場合。
File API を使用して画像をアップロードする: 大きなファイルの場合や、複数のリクエストで画像を再利用する場合におすすめします。

URL を使用して画像を渡す

Files API を使用して画像をアップロードし、リクエストで渡すことができます。

Python

from google import genai

client = genai.Client()

uploaded_file = client.files.upload(file="path/to/organ.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": uploaded_file.uri,
            "mime_type": uploaded_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const uploadedFile = await client.files.upload({
    file: "path/to/organ.jpg",
    config: { mimeType: "image/jpeg" }
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: uploadedFile.uri,
            mimeType: uploadedFile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file using the Files API, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

インライン画像データを渡す

画像データを base64 エンコードされた文字列として指定できます。

Python

from google import genai

with open('path/to/small-sample.jpg', 'rb') as f:
    image_bytes = f.read()

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "data": base64.b64encode(image_bytes).decode('utf-8'),
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const client = new GoogleGenAI({});
const base64ImageFile = fs.readFileSync("path/to/small-sample.jpg", {
  encoding: "base64",
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            data: base64ImageFile,
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

IMG_PATH="/path/to/your/image1.jpg"

if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
  B64FLAGS="--input"
else
  B64FLAGS="-w0"
fi

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "data": "'"$(base64 $B64FLAGS $IMG_PATH)"'",
        "mime_type": "image/jpeg"
      }
    ]
  }'

File API を使用して画像をアップロードする

大きなファイルの場合や、同じ画像ファイルを繰り返し使用する場合は、Files API を使用します。Files API ガイドをご覧ください。

Python

from google import genai

client = genai.Client()

my_file = client.files.upload(file="path/to/sample.jpg")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "Caption this image."},
        {
            "type": "image",
            "uri": my_file.uri,
            "mime_type": my_file.mime_type
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const myfile = await client.files.upload({
    file: "path/to/sample.jpg",
    config: { mimeType: "image/jpeg" },
});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "Caption this image."},
        {
            type: "image",
            uri: myfile.uri,
            mime_type: myfile.mimeType
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

# First upload the file (see Files API guide for details)
# Then use the file URI in the request:

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "Caption this image."},
      {
        "type": "image",
        "uri": "YOUR_FILE_URI",
        "mime_type": "image/jpeg"
      }
    ]
  }'

複数の画像を使用したプロンプト

input 配列に複数の画像オブジェクトを含めることで、1 つのプロンプトで複数の画像を指定できます。

Python

from google import genai

client = genai.Client()

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": "What is different between these two images?"},
        {
            "type": "image",
            "uri": "https://example.com/image1.jpg",
            "mime_type": "image/jpeg"
        },
        {
            "type": "image",
            "uri": "https://example.com/image2.jpg",
            "mime_type": "image/jpeg"
        }
    ]
)
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});

const interaction = await client.interactions.create({
    model: "gemini-3-flash-preview",
    input: [
        {type: "text", text: "What is different between these two images?"},
        {
            type: "image",
            uri: "https://example.com/image1.jpg",
            mime_type: "image/jpeg"
        },
        {
            type: "image",
            uri: "https://example.com/image2.jpg",
            mime_type: "image/jpeg"
        }
    ]
});
console.log(interaction.steps.at(-1).content[0].text);

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemini-3-flash-preview",
    "input": [
      {"type": "text", "text": "What is different between these two images?"},
      {
        "type": "image",
        "uri": "https://example.com/image1.jpg",
        "mime_type": "image/jpeg"
      },
      {
        "type": "image",
        "uri": "https://example.com/image2.jpg",
        "mime_type": "image/jpeg"
      }
    ]
  }'

オブジェクト検出

モデルは、画像内のオブジェクトを検出し、その境界ボックスの座標を取得するようにトレーニングされています。座標は画像サイズを基準として [0, 1000] にスケーリングされます。元の画像サイズに基づいて、これらの座標をスケールダウンする必要があります。

Python

from google import genai
from PIL import Image
import json

client = genai.Client()
prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    response_format={
        "type": "text",
        "mime_type": "application/json"
    }
)

bounding_boxes = json.loads(interaction.steps[-1].content[0].text)
print("Bounding boxes:", bounding_boxes)

その他の例については、Gemini Cookbook の次のノートブックをご覧ください。

セグメンテーション

Gemini 2.5 以降では、モデルはアイテムを検出するだけでなく、セグメント化して輪郭マスクを提供します。

モデルは JSON リストを予測します。各アイテムはセグメンテーションマスクを表します。各アイテムには、0 ～ 1000 の正規化された座標を持つ [y0, x0, y1, x1] 形式の境界ボックス（「box_2d」）、オブジェクトを識別するラベル（「label」）、境界ボックス内のセグメンテーションマスク（0 ～ 255 の値を持つ確率マップである base64 エンコードされた PNG）が含まれます。

Python

from google import genai
from PIL import Image
import json

client = genai.Client()

prompt = """
Give the segmentation masks for the wooden and glass items.
Output a JSON list of segmentation masks where each entry contains the 2D
bounding box in the key "box_2d", the segmentation mask in key "mask", and
the text label in the key "label". Use descriptive labels.
"""

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input=[
        {"type": "text", "text": prompt},
        {
            "type": "image",
            "uri": "https://example.com/image.png",
            "mime_type": "image/png"
        }
    ],
    config={
        "thinking_level": "minimal"  # Minimize thinking for better detection results
    }
)

items = json.loads(interaction.steps[-1].content[0].text)
print("Segmentation results:", items)

カップケーキが並べられたテーブル。木製とガラス製のオブジェクトがハイライト表示されている — オブジェクトとセグメンテーションマスクを含むセグメンテーション出力の例

サポートされている画像形式

Gemini は、次の画像形式の MIME タイプをサポートしています。

PNG - image/png
JPEG - image/jpeg
WEBP - image/webp
HEIC - image/heic
HEIF - image/heif

他のファイル入力方法については、ファイル入力方法ガイドをご覧ください。

機能

すべての Gemini モデルバージョンはマルチモーダルであり、画像キャプション、Visual Question & Answering、画像分類、オブジェクト検出、セグメンテーションなど、幅広い画像処理タスクやコンピュータビジョンタスクで使用できます。

Gemini を使用すると、品質とパフォーマンスの要件に応じて、専用の ML モデルを使用する必要がなくなります。

制限事項と主な技術情報

ファイル制限

Gemini モデルは、リクエストごとに最大 3,600 個の画像ファイルをサポートしています。

トークン計算

両方の寸法が 384 ピクセル以下の場合は 258 トークン。大きな画像は 768x768 ピクセルのタイルに分割され、それぞれ 258 トークンかかります。

タイルの数を計算するおおよその式は次のとおりです。

切り抜き単位のサイズを計算します。これはおおよそ floor(min(width, height) / 1.5) です。
各ディメンションを切り抜き単位のサイズで割り、掛け合わせてタイルの数を求めます。

たとえば、960x540 の画像の切り抜き単位のサイズは 360 になります。各ディメンションを 360 で割ると、タイルの数は 3 * 2 = 6 になります。

メディアの解像度

Gemini 3 では、media_resolution パラメータを使用して、マルチモーダルビジョン処理をきめ細かく制御できます。media_resolution パラメータは、入力画像または動画フレームごとに割り当てられるトークンの最大数 を決定します。解像度が高いほど、モデルが細かいテキストを読み取ったり、小さな詳細を識別する能力が向上しますが、トークンの使用量とレイテンシが増加します。

次のステップ

このガイドでは、画像ファイルをアップロードし、画像入力からテキスト出力を生成する方法について説明します。詳細については、次のリソースをご覧ください。

Files API: Gemini で使用するファイルのアップロードと管理について説明します。
システム指示: システム指示を使用すると、特定のニーズやユースケースに基づいてモデルの動作を制御できます。
ファイルプロンプトの戦略: Gemini API は、テキスト、画像、音声、動画データを使用したプロンプト（マルチモーダルプロンプトとも呼ばれます）をサポートしています。
安全性に関するガイダンス: 生成 AI モデルは、不正確、偏見がある、不快な出力など、予期しない出力を生成することがあります。このような出力による危害のリスクを軽減するには、後処理と人間による評価が不可欠です。

画像理解

Gemini に画像を渡す

URL を使用して画像を渡す

Python

JavaScript

REST

インライン画像データを渡す

Python

JavaScript

REST

File API を使用して画像をアップロードする

Python

JavaScript

REST

複数の画像を使用したプロンプト

Python

JavaScript

REST

オブジェクト検出

Python

セグメンテーション

Python

サポートされている画像形式

機能

制限事項と主な技術情報

ファイル制限

トークン計算

メディアの解像度

おすすめの方法やお役立ち情報

次のステップ