Explore vision capabilities with the Gemini API

The Gemini API is able to process images and videos, enabling a multitude of exciting developer use cases. Some of Gemini's vision capabilities include the ability to:

  • Caption and answer questions about images
  • Transcribe and reason over PDFs, including long documents up to 2 million token context window
  • Describe, segment, and extract information from videos, including both visual frames and audio, up to 90 minutes long
  • Detect objects in an image and return bounding box coordinates for them

This tutorial demonstrates some possible ways to prompt the Gemini API with images and video input, provides code examples, and outlines prompting best practices with multimodal vision capabilities. All output is text-only.

What's next

This guide shows how to upload image and video files using the File API and then generate text outputs from image and video inputs. To learn more, see the following resources:

  • File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
  • System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
  • Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.