In a typical AI workflow, you might pass the same input tokens over and over to a model. Using the Gemini API context caching feature, you can pass some content to the model once, cache the input tokens, and then refer to the cached tokens for subsequent requests. At certain volumes, using cached tokens is lower cost than passing in the same corpus of tokens repeatedly.
When you cache a set of tokens, you can choose how long you want the cache to exist before the tokens are automatically deleted. This caching duration is called the time to live (TTL). If not set, the TTL defaults to 1 hour. The cost for caching depends on the input token size and how long you want the tokens to persist.
Context caching supports both Gemini 1.5 Pro and Gemini 1.5 Flash.
When to use context caching
Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases such as:
- Chatbots with extensive system instructions
- Repetitive analysis of lengthy video files
- Recurring queries against large document sets
- Frequent code repository analysis or bug fixing
How caching reduces costs
Context caching is a paid feature designed to reduce overall operational costs. Billing is based on the following factors:
- Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.
- Storage duration: The amount of time cached tokens are stored (TTL), billed based on the TTL duration of cached token count. There are no minimum or maximum bounds on the TTL.
- Other factors: Other charges apply, such as for non-cached input tokens and output tokens.
For up-to-date pricing details, refer to the Gemini API pricing page. To learn how to count tokens, see the Token guide.
How to use context caching
This section assumes that you've installed a Gemini SDK (or have curl installed) and that you've configured an API key, as shown in the quickstart.
Generate content using a cache
The following example shows how to generate content using a cached system instruction and video file.
import { GoogleGenerativeAI } from '@google/generative-ai';
import {
FileState,
GoogleAICacheManager,
GoogleAIFileManager,
} from '@google/generative-ai/server';
// A helper function that uploads the video to be cached.
async function uploadMp4Video(filePath, displayName) {
const fileManager = new GoogleAIFileManager(process.env.API_KEY);
const fileResult = await fileManager.uploadFile(filePath, {
displayName,
mimeType: 'video/mp4',
});
const { name, uri } = fileResult.file;
// Poll getFile() on a set interval (2 seconds here) to check file state.
let file = await fileManager.getFile(name);
while (file.state === FileState.PROCESSING) {
console.log('Waiting for video to be processed.');
// Sleep for 2 seconds
await new Promise((resolve) => setTimeout(resolve, 2_000));
file = await fileManager.getFile(name);
}
console.log(`Video processing complete: ${uri}`);
return fileResult;
}
// Download video file
// curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4
const pathToVideoFile = 'Sherlock_Jr_FullMovie.mp4';
// Upload the video.
const fileResult = await uploadMp4Video(pathToVideoFile, 'Sherlock Jr. video');
// Construct a GoogleAICacheManager using your API key.
const cacheManager = new GoogleAICacheManager(process.env.API_KEY);
// Create a cache with a 5 minute TTL.
const displayName = 'sherlock jr movie';
const model = 'models/gemini-1.5-flash-001';
const systemInstruction =
'You are an expert video analyzer, and your job is to answer ' +
"the user's query based on the video file you have access to.";
let ttlSeconds = 300;
const cache = await cacheManager.create({
model,
displayName,
systemInstruction,
contents: [
{
role: 'user',
parts: [
{
fileData: {
mimeType: fileResult.file.mimeType,
fileUri: fileResult.file.uri,
},
},
],
},
],
ttlSeconds,
});
// Get your API key from https://aistudio.google.com/app/apikey
// Access your API key as an environment variable.
const genAI = new GoogleGenerativeAI(process.env.API_KEY);
// Construct a `GenerativeModel` which uses the cache object.
const genModel = genAI.getGenerativeModelFromCachedContent(cache);
// Query the model.
const result = await genModel.generateContent({
contents: [
{
role: 'user',
parts: [
{
text:
'Introduce different characters in the movie by describing ' +
'their personality, looks, and names. Also list the ' +
'timestamps they were introduced for the first time.',
},
],
},
],
});
console.log(result.response.usageMetadata);
// The output should look something like this:
//
// {
// promptTokenCount: 696220,
// candidatesTokenCount: 270,
// totalTokenCount: 696490,
// cachedContentTokenCount: 696191
// }
console.log(result.response.text());
List caches
It's not possible to retrieve or view cached content, but you can retrieve
cache metadata (name
, model
, displayName
, usageMetadata
,
createTime
, updateTime
, and expireTime
).
To list metadata for all uploaded caches, use GoogleAICacheManager.list()
:
const listResult = await cacheManager.list();
listResult.cachedContents.forEach((cache) => {
console.log(cache);
});
Update a cache
You can set a new ttl
or expireTime
for a cache. Changing anything else
about the cache isn't supported.
The following example shows how to update the ttl
of a cache using
GoogleAICacheManager.update()
.
const ttlSeconds = 2 * 60 * 60;
const updateParams = { cachedContent: { ttlSeconds } };
const updatedCache = await cacheManager.update(cacheName, updateParams);
Delete a cache
The caching service provides a delete operation for manually removing content
from the cache. The following example shows how to delete a cache using
GoogleAICacheManager.delete()
.
await cacheManager.delete(cacheName);
Additional considerations
Keep the following considerations in mind when using context caching:
- The minimum input token count for context caching is 32,768, and the maximum is the same as the maximum for the given model. (For more on counting tokens, see the Token guide).
- The model doesn't make any distinction between cached tokens and regular input tokens. Cached content is simply a prefix to the prompt.
- There are no special rate or usage limits on context caching; the standard
rate limits for
GenerateContent
apply, and token limits include cached tokens. - The number of cached tokens is returned in the
usage_metadata
from the create, get, and list operations of the cache service, and also inGenerateContent
when using the cache.