Explore vision capabilities with the Gemini API

The Gemini API can run inference on images and videos passed to it. When passed an image, a series of images, or a video, Gemini can:

  • Describe or answer questions about the content
  • Summarize the content
  • Extrapolate from the content

This tutorial demonstrates some possible ways to prompt the Gemini API with images and video input. All output is text-only.

What's next

This guide shows how to use generateContent and to generate text outputs from image and video inputs. To learn more, see the following resources:

  • Prompting with media files: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
  • System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
  • Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.