上下文缓存

在典型的 AI 工作流中，您可能会反复将相同的输入令牌传递给模型。借助 Gemini API 上下文缓存功能，您可以将一些内容一次性传递给模型，缓存输入令牌，然后在后续请求中引用缓存的令牌。在某些数量下，使用缓存的令牌比重复传入相同语料库的令牌费用更低。

缓存一组令牌时，您可以选择缓存在令牌被自动删除之前存在的时长。此缓存时长称为存留时间 (TTL)。如果未设置，TTL 默认为 1 小时。缓存的开销取决于输入令牌的大小以及您希望令牌保留多长时间。

上下文缓存支持 Gemini 1.5 Pro 和 Gemini 1.5 Flash。

应在何时使用上下文缓存

上下文缓存特别适合较短的请求重复引用大量初始上下文的场景。例如，对于以下使用场景，可以考虑使用上下文缓存：

有大量系统指令的聊天机器人
对较长的视频文件进行的重复分析
针对大型文档集的定期查询
频繁的代码库分析或 bug 修复

缓存如何降低费用

虽然上下文缓存是一项付费功能，但它的目的是为了降低整体的运营成本。结算取决于以下因素：

缓存词元数：缓存的输入词元数，如果相同的词元在后续提示中被重复使用，则按折扣费率计费。
存储时长：缓存令牌的存储时间 (TTL)，根据缓存令牌计数的 TTL 时长计费。TTL 没有上下限。
其他因素：可能还会产生其他费用，例如非缓存输入词元和输出词元的费用。

如需了解最新的价格详情，请参阅 Gemini API 价格页面。如需了解如何计算令牌数，请参阅令牌指南。

如何使用上下文缓存

本部分假定您已安装 Gemini SDK（或已安装 curl），并且已配置 API 密钥，如快速入门中所示。

使用缓存生成内容

以下示例展示了如何使用缓存的系统指令和视频文件生成内容。

import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time

# Get your API key from https://aistudio.google.com/app/apikey
# and access your API key as an environment variable.
# To authenticate from a Colab, see
# https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb
genai.configure(api_key=os.environ['API_KEY'])

# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'

# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)

# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
  print('Waiting for video to be processed.')
  time.sleep(2)
  video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='sherlock jr movie', # used to identify the cache
    system_instruction=(
        'You are an expert video analyzer, and your job is to answer '
        'the user\'s query based on the video file you have access to.'
    ),
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content([(
    'Introduce different characters in the movie by describing '
    'their personality, looks, and names. Also list the timestamps '
    'they were introduced for the first time.')])

print(response.usage_metadata)

# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433

print(response.text)

列出缓存

无法检索或查看缓存的内容，但可以检索缓存元数据（name、model、display_name、usage_metadata、create_time、update_time 和 expire_time）。

如需列出所有已上传缓存的元数据，请使用 CachedContent.list()：

for c in caching.CachedContent.list():
  print(c)

更新缓存

您可以为缓存设置新的 ttl 或 expire_time。不支持更改缓存的任何其他内容。

以下示例展示了如何使用 CachedContent.update() 更新缓存的 ttl。

import datetime

cache.update(ttl=datetime.timedelta(hours=2))

删除缓存

缓存服务提供了删除操作，用于手动从缓存中移除内容。以下示例展示了如何使用 CachedContent.delete() 删除缓存。

cache.delete()

其他注意事项

使用上下文缓存时，请注意以下事项：

上下文缓存的输入令牌数下限为 32,768，输入令牌数上限与给定模型的上限相同。如需详细了解如何统计令牌，请参阅令牌指南。
模型对缓存的令牌和常规输入令牌没有任何区别。缓存内容只是提示的前缀。
上下文缓存没有特殊的速率或使用限制；适用 GenerateContent 的标准速率限制，并且令牌限制包括缓存的令牌。
缓存令牌的数量会在缓存服务的创建、获取和列表操作的 usage_metadata 中返回，在使用缓存时也会在 GenerateContent 中返回。