Flex inference
The Gemini Flex API is an inference tier that offers a 50% cost reduction compared to standard rates, in exchange for variable latency and best-effort availability. It's designed for latency-tolerant workloads that require synchronous processing but don't need the real-time performance of the standard API.
How to use Flex
To use the Flex tier, specify the service_tier as flex in your request. By default, requests use the standard tier if this field is omitted.
Python
from google import genai
client = genai.Client()
try:
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input="Analyze this dataset for trends...",
service_tier='flex'
)
print(interaction.steps[-1].content[0].text)
except Exception as e:
print(f"Flex request failed: {e}")
JavaScript
import { GoogleGenAI } from '@google/genai';
const client = new GoogleGenAI({});
async function main() {
try {
const interaction = await client.interactions.create({
model: 'gemini-3-flash-preview',
input: 'Analyze this dataset for trends...',
serviceTier: 'flex'
});
console.log(interaction.steps.at(-1).content[0].text);
} catch (e) {
console.log(`Flex request failed: ${e}`);
}
}
await main();
REST
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "Content-Type: application/json" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-d '{
"model": "gemini-3-flash-preview",
"input": "Analyze this dataset for trends...",
"service_tier": "flex"
}'
How Flex inference works
Gemini Flex inference bridges the gap between the standard API and the 24-hour turnaround of the Batch API. It utilizes off-peak, "sheddable" compute capacity to provide a cost-effective solution for background tasks and sequential workflows.
| Feature | Flex | Priority | Standard | Batch |
|---|---|---|---|---|
| Pricing | 50% discount | 75-100% more than Standard | Full price | 50% discount |
| Latency | Minutes (1–15 min target) | Low (Seconds) | Seconds to minutes | Up to 24 hours |
| Reliability | Best-effort (Sheddable) | High (Non-sheddable) | High / Medium-high | High (for throughput) |
| Interface | Synchronous | Synchronous | Synchronous | Asynchronous |
Key benefits
- Cost efficiency: Substantial savings for non-production evals, background agents, and data enrichment.
- Low friction: Simply add a single parameter to your existing requests.
- Synchronous workflows: Ideal for sequential API chains where the next request depends on the output of the previous one, making it more flexible than Batch for agentic workflows.
Use cases
- Offline evaluations: Running "LLM-as-a-judge" regression tests or leaderboards.
- Background agents: Sequential tasks like CRM updates, profile building, or content moderation where minutes of delay are acceptable.
- Budget-constrained research: Academic experiments that require high token volume on a limited budget.
Rate limits
Flex inference traffic counts towards your general rate limits; it doesn't offer extended rate limits like the Batch API.
Sheddable capacity
Flex traffic is treated with lower priority. If there is a spike in standard traffic, Flex requests may be preempted or evicted to ensure capacity for high-priority users. If you're looking for high-priority inference, check Priority inference
Error codes
When Flex capacity is unavailable or the system is congested, the API will return standard error codes:
- 503 Service Unavailable: The system is currently at capacity.
- 429 Too Many Requests: Rate limits or resource exhaustion.
Client responsibility
- No server-side fallback: To prevent unexpected charges, the system won't automatically upgrade a Flex request to the Standard tier if Flex capacity is full.
- Retries: You must implement your own client-side retry logic with exponential backoff.
- Timeouts: Because Flex requests may sit in a queue, we recommend increasing client-side timeouts to 10 minutes or more to avoid premature connection closure.
Adjust timeout windows
You can configure per-request timeouts for the REST API and client libraries. Always ensure your client-side timeout covers the intended server patience window (e.g., 600s+ for Flex wait queues). The SDKs expect timeout values in milliseconds.
Per-request timeouts
Python
from google import genai
client = genai.Client()
try:
interaction = client.interactions.create(
model="gemini-3-flash-preview",
input="why is the sky blue?",
service_tier="flex",
http_options={"timeout": 900000}
)
except Exception as e:
print(f"Flex request failed: {e}")
JavaScript
import { GoogleGenAI } from '@google/genai';
const client = new GoogleGenAI({});
async function main() {
try {
const interaction = await client.interactions.create({
model: "gemini-3-flash-preview",
input: "why is the sky blue?",
serviceTier: "flex",
httpOptions: {timeout: 900000}
});
} catch (e) {
console.log(`Flex request failed: ${e}`);
}
}
await main();
Implement retries
Because Flex is sheddable and fails with 503 errors, here is an example of optionally implementing retry logic to continue with failed requests:
Python
import time
from google import genai
client = genai.Client()
def call_with_retry(max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return client.interactions.create(
model="gemini-3-flash-preview",
input="Analyze this batch statement.",
service_tier="flex",
)
except Exception as e:
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt) # Exponential Backoff
print(f"Flex busy, retrying in {delay}s...")
time.sleep(delay)
else:
# Fallback to standard on last strike (Optional)
print("Flex exhausted, falling back to Standard...")
return client.interactions.create(
model="gemini-3-flash-preview",
input="Analyze this batch statement."
)
# Usage
interaction = call_with_retry()
print(interaction.steps[-1].content[0].text)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({});
async function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function callWithRetry(maxRetries = 3, baseDelay = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
console.log(`Attempt ${attempt + 1}: Calling Flex tier...`);
const interaction = await ai.interactions.create({
model: "gemini-3-flash-preview",
input: "Analyze this batch statement.",
serviceTier: 'flex',
});
return interaction;
} catch (e) {
if (attempt < maxRetries - 1) {
const delay = baseDelay * (2 ** attempt);
console.log(`Flex busy, retrying in ${delay}s...`);
await sleep(delay * 1000);
} else {
console.log("Flex exhausted, falling back to Standard...");
return await ai.interactions.create({
model: "gemini-3-flash-preview",
input: "Analyze this batch statement.",
});
}
}
}
}
async function main() {
const interaction = await callWithRetry();
console.log(interaction.steps.at(-1).content[0].text);
}
await main();
Pricing
Flex inference is priced at 50% of the standard API and billed per token.
Supported models
The following models support Flex inference:
| Model | Flex inference |
|---|---|
| Gemini 3.1 Flash-Lite | ✔️ |
| Gemini 3.1 Flash-Lite Preview | ✔️ |
| Gemini 3.1 Pro Preview | ✔️ |
| Gemini 3 Flash Preview | ✔️ |
| Gemini 2.5 Pro | ✔️ |
| Gemini 2.5 Flash | ✔️ |
| Gemini 2.5 Flash-Lite | ✔️ |
What's next
- Priority inference for ultra-low latency.
- Tokens: Understand tokens.