The Gemini API can process and run inference on PDF documents passed to it. When a PDF is uploaded, the Gemini API can:
- Describe or answer questions about the content
- Summarize the content
- Extrapolate from the content
This tutorial demonstrates some possible ways to prompt the Gemini API with provided PDF documents. All output is text-only.
Before you begin: Set up your project and API key
Before calling the Gemini API, you need to set up your project and configure your API key.
Get and secure your API key
You need an API key to call the Gemini API. If you don't already have one, create a key in Google AI Studio.
It's strongly recommended that you do not check an API key into your version control system.
You should store your API key in a secrets store such as Google Cloud Secret Manager.
This tutorial assumes that you're accessing your API key as an environment variable.
Install the SDK package and configure your API key
The Python SDK for the Gemini API is contained in the
google-generativeai
package.
Install the dependency using pip:
pip install -U google-generativeai
Import the package and configure the service with your API key:
import os import google.generativeai as genai genai.configure(api_key=os.environ['API_KEY'])
Technical details
Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 document pages. Document
pages must be in the application/pdf
MIME type.
Each document page is equivalent to 258 tokens.
While there are no specific limits to the number of pixels in a document besides the model's context window, larger pages are scaled down to a maximum resolution of 3072x3072 while preserving their original aspect ratio, while smaller pages are scaled up to 768x768 pixels. There is no cost reduction for pages at lower sizes, other than bandwidth, or performance improvement for pages at higher resolution.
For best results:
- Rotate pages to the correct orientation before uploading.
- Avoid blurry pages.
- If using a single page, place the text prompt after the page.
Upload a document using the File API
Use the File API to upload a document of any size. (Always use the File API when the combination of files and system instructions that you want to send is larger than 20 MB.)
Start by downloading this paper on Gemini 1.5:
!curl -o gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf
Upload the document using media.upload
and
print the URI, which is used as a reference in Gemini API calls:
# Upload the file and print a confirmation
sample_file = genai.upload_file(path="gemini.pdf",
display_name="Gemini 1.5 PDF")
print(f"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}")
Verify PDF file upload and get metadata
You can verify the API successfully stored the uploaded file and get its
metadata by calling files.get
through the SDK. Only the name
(and by
extension, the uri
) are unique. Use display_name
to identify files only if
you manage uniqueness yourself.
file = genai.get_file(name=sample_file.name)
print(f"Retrieved file '{file.display_name}' as: {sample_file.uri}")
Depending on your use case, you can store the URIs in structures, such as a
dict
or a database.
Prompt the Gemini API with the uploaded documents
After uploading the file, you can make GenerateContent
requests that reference
the File API URI. Select the generative model and provide it with a text prompt
and the uploaded document:
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
# Prompt the model with text and the previously uploaded image.
response = model.generate_content([sample_file, "Can you summarize this document as a bulleted list?"])
print(response.text)
Upload one or more locally stored files
Alternatively, you can upload one or more locally stored files.
When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:
import PyPDF2
def extract_text_from_pdf(pdf_path):
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
extracted_text = ""
for page in pdf_reader.pages:
text = page.extract_text()
if text:
extracted_text += text
return extracted_text
sample_file_2 = extract_text_from_pdf('example-1.pdf')
sample_file_3 = extract_text_from_pdf('example-2.pdf')
Prompt with multiple documents
You can provide the Gemini API with any combination of documents and text that fit within the model's context window. This example provides one short text prompt and three documents previously uploaded:
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
prompt = "Summarize the differences between the thesis statements for these documents."
response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])
print(response.text)
List files
You can list all files uploaded using the File API and their URIs using
files.list_files()
:
# List all files
for file in genai.list_files():
print(f"{file.display_name}, URI: {file.uri}")
Delete files
Files uploaded using the File API are automatically deleted after 2 days. You
can also manually delete them using files.delete_file()
:
# Delete file
genai.delete_file(document_file.name)
print(f'Deleted file {document_file.uri}')
What's next
This guide shows how to use
generateContent
and
to generate text outputs from processed documents. To learn more,
see the following resources:
- Prompting with media files: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.