The Gemini API can run inference on images and videos passed to it. When passed an image, a series of images, or a video, Gemini can:
- Describe or answer questions about the content
- Summarize the content
- Extrapolate from the content
This tutorial demonstrates some possible ways to prompt the Gemini API with images and video input. All output is text-only.
What's next
This guide shows how to use
generateContent
and
to generate text outputs from image and video inputs. To learn more,
see the following resources:
- Prompting with media files: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
- System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
- Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.