The Gemini API supports PDF input, including long documents (up to 3600 pages). Gemini models process PDFs with native vision, and are therefore able to understand both text and image contents inside documents. With native PDF vision support, Gemini models are able to:
- Analyze diagrams, charts, and tables inside documents.
- Extract information into structured output formats.
- Answer questions about visual and text contents in documents.
- Summarize documents.
- Transcribe document content (e.g. to HTML) preserving layouts and formatting, for use in downstream applications (such as in RAG pipelines).
This tutorial demonstrates some possible ways to use the Gemini API with PDF documents. All output is text-only.
What's next
This guide shows how to use
generateContent
and
to generate text outputs from processed documents. To learn more,
see the following resources:
- File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.