ドキュメントの理解

Gemini API は、長いドキュメント（最大 1, 000 ページ）を含む PDF 入力をサポートしています。Gemini モデルはネイティブなビジョンで PDF を処理するため、ドキュメント内のテキストと画像の両方のコンテンツを理解できます。ネイティブの PDF ビジョンをサポートしているため、Gemini モデルは次のことができます。

ドキュメント内の図、グラフ、表を分析する
情報を構造化された出力形式に抽出する
ドキュメント内の画像とテキストの内容に関する質問に回答する
ドキュメントを要約する
ドキュメントのコンテンツを文字起こし（HTML など）し、レイアウトと書式を保持してダウンストリームアプリケーションで使用できるようにする

このチュートリアルでは、Gemini API を使用して PDF ドキュメントを処理する方法について説明します。

PDF 入力

PDF ペイロードが 20 MB 未満の場合は、base64 エンコードされたドキュメントをアップロードするか、ローカルに保存されているファイルを直接アップロードするかを選択できます。

インラインデータとして

PDF ドキュメントは URL から直接処理できます。その方法を示すコードスニペットは次のとおりです。

from google import genai
from google.genai import types
import httpx

client = genai.Client()

doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"

# Retrieve and encode the PDF byte
doc_data = httpx.get(doc_url).content

prompt = "Summarize this document"
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
      types.Part.from_bytes(
        data=doc_data,
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

詳細な技術情報

Gemini 1.5 Pro と 1.5 Flash は、最大 3,600 ページのドキュメントをサポートしています。ドキュメントページは、次のいずれかのテキストデータ MIME タイプである必要があります。

PDF - application/pdf
JavaScript - application/x-javascript、text/javascript
Python - application/x-python、text/x-python
TXT - text/plain
HTML - text/html
CSS - text/css
Markdown - text/md
CSV - text/csv
XML - text/xml
RTF - text/rtf

各ドキュメントページは 258 個のトークンに相当します。

モデルのコンテキストウィンドウ以外に、ドキュメント内のピクセル数に特に制限はありませんが、大きなページは元のアスペクト比を維持したまま最大解像度 3, 072x3, 072 に縮小され、小さいページは 768x768 ピクセルに拡大されます。サイズが小さいページでは、帯域幅を除き、費用が削減されることはありません。また、解像度が高いページのパフォーマンスが向上することはありません。

最良の結果を得るために、次のことを行います。

アップロードする前に、ページを適切な向きに回転してください。
ぼやけたページは避けてください。
1 つのページを使用する場合は、ページの後にテキストプロンプトを配置します。

ローカルに保存されている PDF

ローカルに保存されている PDF の場合は、次の方法を使用できます。

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client()

doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"

# Retrieve and encode the PDF byte
filepath = pathlib.Path('file.pdf')
filepath.write_bytes(httpx.get(doc_url).content)

prompt = "Summarize this document"
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

サイズの大きい PDF

サイズの大きいドキュメントをアップロードするには、File API を使用します。リクエストの合計サイズ（ファイル、テキストプロンプト、システムインストラクションなど）が 20 MB を超える場合は、常に File API を使用してください。

media.upload を呼び出して、File API を使用してファイルをアップロードします。次のコードは、ドキュメントファイルをアップロードし、models.generateContent の呼び出しでそのファイルを使用します。

URL からの大容量の PDF

URL から取得できる大規模な PDF ファイルには File API を使用して、URL から直接これらのドキュメントをアップロードして処理するプロセスを簡素化します。

from google import genai
from google.genai import types
import io
import httpx

client = genai.Client()

long_context_pdf_path = "https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf"

# Retrieve and upload the PDF using the File API
doc_io = io.BytesIO(httpx.get(long_context_pdf_path).content)

sample_doc = client.files.upload(
  # You can pass a path or a file-like object here
  file=doc_io,
  config=dict(
    mime_type='application/pdf')
)

prompt = "Summarize this document"

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_doc, prompt])
print(response.text)

ローカルに保存されている大容量の PDF

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client()

long_context_pdf_path = "https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf"

# Retrieve the PDF
file_path = pathlib.Path('A17.pdf')
file_path.write_bytes(httpx.get(long_context_pdf_path).content)

# Upload the PDF using the File API
sample_file = client.files.upload(
  file=file_path,
)

prompt="Summarize this document"

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_file, "Summarize this document"])
print(response.text)

API がアップロードされたファイルを正常に保存したことを確認するには、files.get を呼び出してメタデータを取得します。name（および拡張として uri）のみが一意です。

from google import genai
import pathlib

client = genai.Client()

fpath = pathlib.Path('example.txt')
fpath.write_text('hello')

file = client.files.upload('example.txt')

file_info = client.files.get(file.name)
print(file_info.model_dump_json(indent=4))

複数の PDF

Gemini API は、ドキュメントとテキストプロンプトの合計サイズがモデルのコンテキストウィンドウ内に収まる限り、1 つのリクエストで複数の PDF ドキュメントを処理できます。

from google import genai
import io
import httpx

client = genai.Client()

doc_url_1 = "https://arxiv.org/pdf/2312.11805"
doc_url_2 = "https://arxiv.org/pdf/2403.05530"

# Retrieve and upload both PDFs using the File API
doc_data_1 = io.BytesIO(httpx.get(doc_url_1).content)
doc_data_2 = io.BytesIO(httpx.get(doc_url_2).content)

sample_pdf_1 = client.files.upload(
  file=doc_data_1,
  config=dict(mime_type='application/pdf')
)
sample_pdf_2 = client.files.upload(
  file=doc_data_2,
  config=dict(mime_type='application/pdf')
)

prompt = "What is the difference between each of the main benchmarks between these two papers? Output these in a table."

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_pdf_1, sample_pdf_2, prompt])
print(response.text)

次のステップ

詳細については、次のリソースをご覧ください。

ファイルプロンプト戦略: Gemini API は、テキスト、画像、音声、動画データによるプロンプト（マルチモーダルプロンプト）をサポートしています。
システム指示: システム指示を使用すると、特定のニーズやユースケースに基づいてモデルの動作を制御できます。