Gemini Deep Research 現已推出預先發布版，提供協作規劃、視覺化、MCP 支援等功能。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

彈性推論

注意：這個版本的頁面涵蓋目前為 Beta 版的新版 Interactions API。
如要穩定部署正式版，建議繼續使用 generateContent API。您可以使用這個頁面上的切換鈕，在不同版本之間切換。

Gemini Flex API 是推論層級，與標準費率相比，可節省 50% 的成本，但延遲時間不固定，且僅盡力提供服務。這項 API 適用於可容許延遲的工作負載，需要同步處理，但不需要標準 API 的即時效能。

如何使用 Flex

如要使用 Flex 層級，請在要求中將 service_tier 指定為 flex。如果省略這個欄位，要求會預設使用標準層級。

Python

from google import genai

client = genai.Client()

try:
    interaction = client.interactions.create(
        model="gemini-3-flash-preview",
        input="Analyze this dataset for trends...",
        service_tier='flex'
    )
    print(interaction.steps[-1].content[0].text)
except Exception as e:
    print(f"Flex request failed: {e}")

JavaScript

import { GoogleGenAI } from '@google/genai';

const client = new GoogleGenAI({});

async function main() {
    try {
        const interaction = await client.interactions.create({
            model: 'gemini-3-flash-preview',
            input: 'Analyze this dataset for trends...',
            serviceTier: 'flex'
        });
        console.log(interaction.steps.at(-1).content[0].text);
    } catch (e) {
        console.log(`Flex request failed: ${e}`);
    }
}
await main();

REST

curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
  -H "Content-Type: application/json" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -d '{
      "model": "gemini-3-flash-preview",
      "input": "Analyze this dataset for trends...",
      "service_tier": "flex"
  }'

Flex 推論的運作方式

Gemini Flex 推論可彌補標準 API 與 Batch API 24 小時處理時間之間的落差。這項服務會利用離峰時段的「可卸除」運算容量，為背景任務和循序工作流程提供符合成本效益的解決方案。

功能	Flex	優先順序	標準	批次
定價	50% 折扣	比 Standard 方案多 75% 至 100%	全票	50% 折扣
延遲	分鐘 (目標：1 到 15 分鐘)	低 (秒)	秒到分鐘	長達 24 小時
穩定性	盡可能提供最佳服務 (可捨棄)	高 (不會脫落)	高 / 中高	高 (處理量)
介面	同步	同步	同步	非同步

主要優點

成本效益：大幅節省非正式評估、背景代理程式和資料擴充的費用。
低摩擦：只要在現有要求中加入單一參數即可。
同步工作流程：適合用於連續 API 鏈，其中下一個要求取決於前一個要求的輸出內容，因此比 Batch 更適合代理功能工作流程。

用途

離線評估：執行「LLM 做為評估者」迴歸測試或排行榜。
背景代理：可接受延遲幾分鐘的循序工作，例如更新客戶關係管理系統、建立個人資料或內容審查。
預算不足的研究：學術實驗需要大量符記，但預算有限。

頻率限制

彈性推論流量會計入一般速率限制，不會像 Batch API 一樣提供擴展速率限制。

可卸除容量

彈性流量的優先順序較低，如果標準流量突然暴增，系統可能會搶先處理或清除 Flex 請求，確保高優先順序使用者有足夠的容量。如要瞭解高優先順序推論，請參閱「優先推論」一文。

錯誤代碼

如果彈性容量不足或系統壅塞，API 會傳回標準錯誤代碼：

503 Service Unavailable：系統目前工作負載已達上限。
429 要求數量過多：頻率限制或資源耗盡。

客戶責任

沒有伺服器端備用方案：為避免產生非預期費用，如果彈性容量已滿，系統不會自動將彈性要求升級為標準層級。
重試：您必須自行實作用戶端重試邏輯，並採用指數輪詢策略。
逾時：由於 Flex 請求可能會排隊等候，建議將用戶端逾時時間延長至 10 分鐘以上，以免連線過早關閉。

調整逾時時間

您可以為 REST API 和用戶端程式庫設定每個要求的逾時時間。請務必確保用戶端逾時時間涵蓋預期的伺服器等待時間範圍 (例如 Flex 等候佇列為 600 秒以上)。SDK 預期的逾時值單位為毫秒。

每個要求的逾時時間

Python

from google import genai

client = genai.Client()

try:
    interaction = client.interactions.create(
        model="gemini-3-flash-preview",
        input="why is the sky blue?",
        service_tier="flex",
        http_options={"timeout": 900000}
    )
except Exception as e:
    print(f"Flex request failed: {e}")

JavaScript

import { GoogleGenAI } from '@google/genai';

const client = new GoogleGenAI({});

async function main() {
    try {
        const interaction = await client.interactions.create({
            model: "gemini-3-flash-preview",
            input: "why is the sky blue?",
            serviceTier: "flex",
            httpOptions: {timeout: 900000}
        });
    } catch (e) {
        console.log(`Flex request failed: ${e}`);
    }
}

await main();

實作重試機制

由於 Flex 可卸除，且會因 503 錯誤而失敗，因此以下範例說明如何選擇性地實作重試邏輯，以繼續處理失敗的要求：

Python

import time
from google import genai

client = genai.Client()

def call_with_retry(max_retries=3, base_delay=5):
    for attempt in range(max_retries):
        try:
            return client.interactions.create(
                model="gemini-3-flash-preview",
                input="Analyze this batch statement.",
                service_tier="flex",
            )
        except Exception as e:
            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt) # Exponential Backoff
                print(f"Flex busy, retrying in {delay}s...")
                time.sleep(delay)
            else:
                # Fallback to standard on last strike (Optional)
                print("Flex exhausted, falling back to Standard...")
                return client.interactions.create(
                    model="gemini-3-flash-preview",
                    input="Analyze this batch statement."
                )

# Usage
interaction = call_with_retry()
print(interaction.steps[-1].content[0].text)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({});

async function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function callWithRetry(maxRetries = 3, baseDelay = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      console.log(`Attempt ${attempt + 1}: Calling Flex tier...`);
      const interaction = await ai.interactions.create({
        model: "gemini-3-flash-preview",
        input: "Analyze this batch statement.",
        serviceTier: 'flex',
      });
      return interaction;
    } catch (e) {
      if (attempt < maxRetries - 1) {
        const delay = baseDelay * (2 ** attempt);
        console.log(`Flex busy, retrying in ${delay}s...`);
        await sleep(delay * 1000);
      } else {
        console.log("Flex exhausted, falling back to Standard...");
        return await ai.interactions.create({
          model: "gemini-3-flash-preview",
          input: "Analyze this batch statement.",
        });
      }
    }
  }
}

async function main() {
    const interaction = await callWithRetry();
    console.log(interaction.steps.at(-1).content[0].text);
}

await main();

定價

彈性推論的價格為標準 API 的 50%，並以權杖為單位計費。

支援的模型

下列模型支援 Flex 推論：

型號	彈性推論
Gemini 3.1 Flash-Lite	✔️
Gemini 3.1 Flash-Lite 預先發布版	✔️
Gemini 3.1 Pro 預先發布版	✔️
Gemini 3 Flash 預先發布版	✔️
Gemini 2.5 Pro	✔️
Gemini 2.5 Flash	✔️
Gemini 2.5 Flash-Lite	✔️

後續步驟

優先推論，實現超低延遲。
權杖：瞭解權杖。