दस्तावेज़ को समझना

Gemini API, PDF इनपुट के साथ काम करता है. इसमें 1,000 पेजों तक के लंबे दस्तावेज़ भी शामिल हैं. Gemini मॉडल, नेटिव विज़न की मदद से PDF फ़ाइलों को प्रोसेस करते हैं. इसलिए, ये दस्तावेज़ों में मौजूद टेक्स्ट और इमेज, दोनों तरह के कॉन्टेंट को समझ सकते हैं. PDF के लिए नेटिव विज़न की सुविधा की मदद से, Gemini मॉडल ये काम कर सकते हैं:

दस्तावेज़ों में मौजूद डायग्राम, चार्ट, और टेबल का विश्लेषण करना
जानकारी को स्ट्रक्चर्ड आउटपुट फ़ॉर्मैट में निकालना
दस्तावेज़ों में मौजूद विज़ुअल और टेक्स्ट कॉन्टेंट के बारे में सवालों के जवाब देना
दस्तावेज़ों की ख़ास जानकारी देना
दस्तावेज़ के कॉन्टेंट को एचटीएमएल में ट्रांसक्राइब करना. इसमें लेआउट और फ़ॉर्मैटिंग को बनाए रखा जाता है, ताकि इसे डाउनस्ट्रीम ऐप्लिकेशन में इस्तेमाल किया जा सके

इस ट्यूटोरियल में, PDF दस्तावेज़ों को प्रोसेस करने के लिए, Gemini API का इस्तेमाल करने के कुछ संभावित तरीके दिखाए गए हैं.

PDF इनपुट

20 एमबी से कम के PDF पेलोड के लिए, आपके पास base64 में कोड किए गए दस्तावेज़ अपलोड करने या सीधे तौर पर लोकल स्टोरेज में मौजूद फ़ाइलें अपलोड करने का विकल्प होता है.

इनलाइन डेटा के तौर पर

PDF दस्तावेज़ों को सीधे यूआरएल से प्रोसेस किया जा सकता है. ऐसा करने का तरीका बताने वाला कोड स्निपेट यहां दिया गया है:

from google import genai
from google.genai import types
import httpx

client = genai.Client()

doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"

# Retrieve and encode the PDF byte
doc_data = httpx.get(doc_url).content

prompt = "Summarize this document"
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
      types.Part.from_bytes(
        data=doc_data,
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

तकनीकी जानकारी

Gemini 1.5 Pro और 1.5 Flash, ज़्यादा से ज़्यादा 3,600 पेजों वाले दस्तावेज़ के साथ काम करते हैं. दस्तावेज़ के पेज, टेक्स्ट डेटा के इनमें से किसी एक MIME टाइप में होने चाहिए:

PDF - application/pdf
JavaScript - application/x-javascript, text/javascript
Python - application/x-python, text/x-python
TXT - text/plain
एचटीएमएल - text/html
सीएसएस - text/css
मार्कडाउन - text/md
CSV - text/csv
एक्सएमएल - text/xml
RTF - text/rtf

दस्तावेज़ का हर पेज 258 टोकन के बराबर होता है.

मॉडल की कॉन्टेक्स्ट विंडो के अलावा, किसी दस्तावेज़ में पिक्सल की संख्या की कोई खास सीमा नहीं होती. हालांकि, बड़े पेजों को 3072x3072 पिक्सल के ज़्यादा से ज़्यादा रिज़ॉल्यूशन तक स्केल किया जाता है. ऐसा करते समय, उनके ओरिजनल आसपेक्ट रेशियो को बनाए रखा जाता है. वहीं, छोटे पेजों को 768x768 पिक्सल तक स्केल किया जाता है. कम साइज़ वाले पेजों के लिए, बैंडविड्थ के अलावा कोई और शुल्क नहीं लिया जाता. इसके अलावा, ज़्यादा रिज़ॉल्यूशन वाले पेजों की परफ़ॉर्मेंस भी बेहतर नहीं होती.

सर्वोत्तम परिणामों के लिएः

अपलोड करने से पहले, पेजों को सही ओरिएंटेशन में घुमाएं.
धुंधले पेजों का इस्तेमाल न करें.
अगर एक पेज का इस्तेमाल किया जा रहा है, तो टेक्स्ट प्रॉम्प्ट को पेज के बाद रखें.

डिवाइस पर सेव की गई PDF फ़ाइलें

डिवाइस में सेव किए गए PDF फ़ाइलों के लिए, यह तरीका अपनाएं:

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client()

doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"

# Retrieve and encode the PDF byte
filepath = pathlib.Path('file.pdf')
filepath.write_bytes(httpx.get(doc_url).content)

prompt = "Summarize this document"
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

बड़े साइज़ के PDF

बड़े दस्तावेज़ अपलोड करने के लिए, File API का इस्तेमाल किया जा सकता है. जब अनुरोध का कुल साइज़ (इसमें फ़ाइलें, टेक्स्ट प्रॉम्प्ट, सिस्टम के निर्देश वगैरह शामिल हैं) 20 एमबी से ज़्यादा हो, तो हमेशा File API का इस्तेमाल करें.

File API का इस्तेमाल करके फ़ाइल अपलोड करने के लिए, media.upload को कॉल करें. नीचे दिया गया कोड, दस्तावेज़ फ़ाइल को अपलोड करता है. इसके बाद, models.generateContent को कॉल करने के लिए फ़ाइल का इस्तेमाल करता है.

यूआरएल से बड़ी PDF फ़ाइलें

यूआरएल से उपलब्ध बड़ी PDF फ़ाइलों के लिए, File API का इस्तेमाल करें. इससे, इन दस्तावेज़ों को सीधे उनके यूआरएल से अपलोड करने और प्रोसेस करने की प्रोसेस को आसान बनाया जा सकता है:

from google import genai
from google.genai import types
import io
import httpx

client = genai.Client()

long_context_pdf_path = "https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf"

# Retrieve and upload the PDF using the File API
doc_io = io.BytesIO(httpx.get(long_context_pdf_path).content)

sample_doc = client.files.upload(
  # You can pass a path or a file-like object here
  file=doc_io,
  config=dict(
    mime_type='application/pdf')
)

prompt = "Summarize this document"

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_doc, prompt])
print(response.text)

डिवाइस में सेव किए गए बड़े PDF

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client()

long_context_pdf_path = "https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf"

# Retrieve the PDF
file_path = pathlib.Path('A17.pdf')
file_path.write_bytes(httpx.get(long_context_pdf_path).content)

# Upload the PDF using the File API
sample_file = client.files.upload(
  file=file_path,
)

prompt="Summarize this document"

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_file, "Summarize this document"])
print(response.text)

files.get को कॉल करके, यह पुष्टि की जा सकती है कि अपलोड की गई फ़ाइल को एपीआई ने सही से सेव किया है या नहीं. साथ ही, इसका मेटाडेटा भी पाया जा सकता है. सिर्फ़ name (और एक्सटेंशन के तौर पर, uri) यूनीक होते हैं.

from google import genai
import pathlib

client = genai.Client()

fpath = pathlib.Path('example.txt')
fpath.write_text('hello')

file = client.files.upload('example.txt')

file_info = client.files.get(file.name)
print(file_info.model_dump_json(indent=4))

एक से ज़्यादा PDF

Gemini API, एक ही अनुरोध में कई PDF दस्तावेज़ों को प्रोसेस कर सकता है. हालांकि, ऐसा तब ही होगा, जब दस्तावेज़ों और टेक्स्ट प्रॉम्प्ट का कुल साइज़, मॉडल की कॉन्टेक्स्ट विंडो में रहे.

from google import genai
import io
import httpx

client = genai.Client()

doc_url_1 = "https://arxiv.org/pdf/2312.11805"
doc_url_2 = "https://arxiv.org/pdf/2403.05530"

# Retrieve and upload both PDFs using the File API
doc_data_1 = io.BytesIO(httpx.get(doc_url_1).content)
doc_data_2 = io.BytesIO(httpx.get(doc_url_2).content)

sample_pdf_1 = client.files.upload(
  file=doc_data_1,
  config=dict(mime_type='application/pdf')
)
sample_pdf_2 = client.files.upload(
  file=doc_data_2,
  config=dict(mime_type='application/pdf')
)

prompt = "What is the difference between each of the main benchmarks between these two papers? Output these in a table."

response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[sample_pdf_1, sample_pdf_2, prompt])
print(response.text)

आगे क्या करना है

ज़्यादा जानने के लिए, ये संसाधन देखें:

फ़ाइल के लिए प्रॉम्प्ट करने की रणनीतियां: Gemini API, टेक्स्ट, इमेज, ऑडियो, और वीडियो डेटा के साथ प्रॉम्प्ट करने की सुविधा देता है. इसे मल्टीमॉडल प्रॉम्प्ट भी कहा जाता है.
सिस्टम के निर्देश: सिस्टम के निर्देशों की मदद से, अपनी ज़रूरतों और इस्तेमाल के उदाहरणों के आधार पर, मॉडल के व्यवहार को कंट्रोल किया जा सकता है.