Gemini API 支持 PDF 输入,包括长篇幅文档(最多 3, 600 页)。Gemini 模型使用原生视觉功能处理 PDF,因此能够理解文档中的文本和图片内容。借助原生 PDF 视觉支持,Gemini 模型能够:
- 分析文档中的图表、图表和表格
- 将信息提取为结构化输出格式
- 回答有关文档中视觉内容和文本内容的问题
- 总结文档
- 转写文档内容(例如转写为 HTML),保留布局和格式,以便在下游应用中使用
本教程演示了使用 Gemini API 处理 PDF 文档的一些可能方法。
PDF 输入
对于小于 20MB 的 PDF 载荷,您可以选择上传 base64 编码的文档,也可以直接上传本地存储的文件。
作为 inline_data
您可以直接通过网址处理 PDF 文档。以下代码段展示了如何执行此操作:
DOC_URL="https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"
PROMPT="Summarize this document"
DISPLAY_NAME="base64_pdf"
# Download the PDF
wget -O "${DISPLAY_NAME}.pdf" "${DOC_URL}"
# Check for FreeBSD base64 and set flags accordingly
if [[ "$(base64 --version 2>&1)" = *"FreeBSD"* ]]; then
B64FLAGS="--input"
else
B64FLAGS="-w0"
fi
# Base64 encode the PDF
ENCODED_PDF=$(base64 $B64FLAGS "${DISPLAY_NAME}.pdf")
# Generate content using the base64 encoded PDF
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"inline_data": {"mime_type": "application/pdf", "data": "'"$ENCODED_PDF"'"}},
{"text": "'$PROMPT'"}
]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
# Clean up the downloaded PDF
rm "${DISPLAY_NAME}.pdf"
技术详情
Gemini 1.5 Pro 和 1.5 Flash 最多支持 3,600 个文档页面。文档页面必须采用以下文本数据 MIME 类型之一:
- PDF -
application/pdf
- JavaScript -
application/x-javascript
、text/javascript
- Python -
application/x-python
、text/x-python
- TXT -
text/plain
- HTML -
text/html
- CSS -
text/css
- Markdown -
text/md
- CSV -
text/csv
- XML -
text/xml
- RTF -
text/rtf
每页文档相当于 258 个词元。
除了模型的上下文窗口之外,文档中的像素数量没有具体限制,但较大的页面会缩小到最大分辨率 3072x3072,同时保留其原始宽高比,较小的页面会放大到 768x768 像素。除了带宽外,较小尺寸的网页不会降低费用,较高分辨率的网页也不会提升性能。
为了达到最佳效果,请注意以下事项:
- 请先将页面旋转到正确的方向,然后再上传。
- 避免页面模糊不清。
- 如果使用单个页面,请将文本提示放在该页面后面。
大型 PDF 文件
您可以使用 File API 上传任何大小的文档。当请求总大小(包括文件、文本提示、系统说明等)超过 20 MB 时,请始终使用 File API。
调用 media.upload
以使用 File API 上传文件。以下代码会上传文档文件,然后在对 models.generateContent
的调用中使用该文件。
通过网址上传的大型 PDF 文件
将 File API 用于可通过网址获取的大型 PDF 文件,简化直接通过网址上传和处理这些文档的过程:
PDF_PATH="https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf"
DISPLAY_NAME="A17_FlightPlan"
PROMPT="Summarize this document"
# Download the PDF from the provided URL
wget -O "${DISPLAY_NAME}.pdf" "${PDF_PATH}"
MIME_TYPE=$(file -b --mime-type "${DISPLAY_NAME}.pdf")
NUM_BYTES=$(wc -c < "${DISPLAY_NAME}.pdf")
echo "MIME_TYPE: ${MIME_TYPE}"
echo "NUM_BYTES: ${NUM_BYTES}"
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "${BASE_URL}/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${DISPLAY_NAME}.pdf" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo "file_uri: ${file_uri}"
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "'$PROMPT'"},
{"file_data":{"mime_type": "application/pdf", "file_uri": '$file_uri'}}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
# Clean up the downloaded PDF
rm "${DISPLAY_NAME}.pdf"
存储在本地的大型 PDF 文件
NUM_BYTES=$(wc -c < "${PDF_PATH}")
DISPLAY_NAME=TEXT
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "${BASE_URL}/upload/v1beta/files?key=${GEMINI_API_KEY}" \
-D upload-header.tmp \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: application/pdf" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${PDF_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"text": "Can you add a few more lines to this poem?"},
{"file_data":{"mime_type": "application/pdf", "file_uri": '$file_uri'}}]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
您可以调用 files.get
来验证 API 是否已成功存储上传的文件,并获取其元数据。只有 name
(以及通过扩展,uri
)是唯一的。
name=$(jq ".file.name" file_info.json)
# Get the file of interest to check state
curl https://generativelanguage.googleapis.com/v1beta/files/$name > file_info.json
# Print some information about the file you got
name=$(jq ".file.name" file_info.json)
echo name=$name
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
多个 PDF 文件
Gemini API 能够在单个请求中处理多个 PDF 文档,前提是文档和文本提示的总大小在模型的上下文窗口内。
DOC_URL_1="https://arxiv.org/pdf/2312.11805"
DOC_URL_2="https://arxiv.org/pdf/2403.05530"
DISPLAY_NAME_1="Gemini_paper"
DISPLAY_NAME_2="Gemini_1.5_paper"
PROMPT="What is the difference between each of the main benchmarks between these two papers? Output these in a table."
# Function to download and upload a PDF
upload_pdf() {
local doc_url="$1"
local display_name="$2"
# Download the PDF
wget -O "${display_name}.pdf" "${doc_url}"
local MIME_TYPE=$(file -b --mime-type "${display_name}.pdf")
local NUM_BYTES=$(wc -c < "${display_name}.pdf")
echo "MIME_TYPE: ${MIME_TYPE}"
echo "NUM_BYTES: ${NUM_BYTES}"
local tmp_header_file=upload-header.tmp
# Initial resumable request
curl "${BASE_URL}/upload/v1beta/files?key=${GOOGLE_API_KEY}" \
-D "${tmp_header_file}" \
-H "X-Goog-Upload-Protocol: resumable" \
-H "X-Goog-Upload-Command: start" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${display_name}'}}" 2> /dev/null
local upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "\r")
rm "${tmp_header_file}"
# Upload the PDF
curl "${upload_url}" \
-H "Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Offset: 0" \
-H "X-Goog-Upload-Command: upload, finalize" \
--data-binary "@${display_name}.pdf" 2> /dev/null > "file_info_${display_name}.json"
local file_uri=$(jq ".file.uri" "file_info_${display_name}.json")
echo "file_uri for ${display_name}: ${file_uri}"
# Clean up the downloaded PDF
rm "${display_name}.pdf"
echo "${file_uri}"
}
# Upload the first PDF
file_uri_1=$(upload_pdf "${DOC_URL_1}" "${DISPLAY_NAME_1}")
# Upload the second PDF
file_uri_2=$(upload_pdf "${DOC_URL_2}" "${DISPLAY_NAME_2}")
# Now generate content using both files
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [{
"parts":[
{"file_data": {"mime_type": "application/pdf", "file_uri": '$file_uri_1'}},
{"file_data": {"mime_type": "application/pdf", "file_uri": '$file_uri_2'}},
{"text": "'$PROMPT'"}
]
}]
}' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json
后续步骤
如需了解详情,请参阅以下资源: