Interactions API 现已正式发布。我们建议使用此 API 来访问所有最新功能和模型。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

计算机使用

借助“计算机使用”工具，您可以构建浏览器、移动设备和桌面设备控制代理，让其与用户交互并自动执行任务。借助屏幕截图，模型可以“看到”电脑屏幕，并通过生成特定的界面操作（例如鼠标点击和键盘输入）来“行动”。与函数调用类似，您需要实现客户端执行环境，以接收和执行“计算机使用”操作。

Gemini 3.5 Flash 是推荐的 Computer Use 模型，并引入了多项新功能：

支持多种环境：为浏览器、移动设备和桌面设备环境构建代理。
通过 intent 简化的操作：操作包含 intent 字段，用于说明模型在每个步骤背后的推理过程。
可配置的安全政策：通过内置的政策类别和替换项来微调安全行为。
提示注入检测：选择启用屏幕截图扫描，以检测隐藏的对抗性指令。

借助“电脑使用”功能，您可以构建能够执行以下操作的智能体：

自动执行网站上重复的数据输入或表单填写操作。
自动测试 Web 应用和用户流程
在各种网站上进行研究（例如，从电子商务网站收集产品信息、价格和评价，以便做出购买决策）

下面是一个启用“电脑使用”工具的简短示例：

Python

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Search for 'Gemini API' on Google.",
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
            )
        )]
    )
)

print(response.text)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: "Search for 'Gemini API' on Google.",
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
      }
    }]
  }
});

console.log(response.text);

“计算机使用”模型的工作原理

如需使用计算机使用模型构建代理，您需要在应用与 API 之间设置一个持续循环。以下是您的代码在每个步骤中的作用：

向模型发送请求
- 您的应用会发送一个 API 请求，其中包含“电脑使用”工具、您的配置设置（例如目标环境）、用户的提示以及当前屏幕的屏幕截图。
接收模型响应
- 模型会分析屏幕和提示，返回包含建议 function_call 的回答，该建议 function_call 表示界面操作（例如点击、滚动或按键）。
- 对于 Gemini 3.5 Flash，回答还包含推理 intent，用于说明模型选择该操作的原因。
- 对于旧版模型（例如 gemini-2.5-computer-use-preview-10-2025），响应可能包含来自内部安全系统的 safety_decision，该系统会将操作归类为常规/允许、require_confirmation（需要用户批准）或已屏蔽。
执行收到的操作
- 如果允许执行该操作（或用户确认允许），您的客户端代码会解析 function_call，缩放归一化坐标以匹配您的视口，并使用自动化工具（例如 Playwright）在目标环境中执行该操作。如果操作被阻止，客户端应停止执行或处理中断。
捕获新环境状态
- 操作执行完毕后，应用会捕获新的屏幕截图，并通过 function_result 将其发送回模型，以请求执行下一步操作。

然后，此过程会从第 2 步开始重复，不断向模型征求下一个操作，直到任务完成或终止。

“计算机使用”概览

如何实现“计算机使用”

在使用“电脑使用情况”工具进行构建之前，您需要设置以下内容：

安全执行环境：在沙盒虚拟机或容器中运行代理，以将其与主机系统隔离开来，并限制其潜在影响。参考实现包含一个可直接使用的基于 Docker 的沙盒，您可以从这里开始。
客户端操作处理程序：实现客户端逻辑，以执行坐标、输入文本和拍摄屏幕截图。

以下示例使用 Web 浏览器作为执行环境，并使用 Playwright 作为客户端处理程序。

0. 设置 Playwright

首先，安装所需的软件包：

pip install google-genai playwright
playwright install chromium

然后，初始化一个 Playwright 浏览器实例以供执行：

from playwright.sync_api import sync_playwright

# 1. Configure screen dimensions for the target environment
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

# 2. Start the Playwright browser
# In production, utilize a sandboxed environment.
playwright = sync_playwright().start()
# Set headless=False to see the actions performed on your screen
browser = playwright.chromium.launch(headless=False)

# 3. Create a context and page with the specified dimensions
context = browser.new_context(
    viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()

# 4. Navigate to an initial page to start the task
page.goto("https://www.google.com")

# The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
# will be used in the steps below.

1. 向模型发送请求

初始化客户端库并配置“计算机使用”工具。请注意，发出请求时无需指定显示大小；模型会预测缩放到屏幕高度和宽度的像素坐标。

Gemini 3.5 Flash（推荐）

Python

使用 google-genai Python SDK（版本 2.7.0 或更高版本）配置以浏览器环境为目标的请求：

from google import genai
from google.genai.types import (
    Content,
    Part,
    GenerateContentConfig,
    Tool,
    ComputerUse,
    Environment,
    ThinkingConfig,
)

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=[
        Content(
            role="user",
            parts=[
                Part(text="Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th"),
            ],
        )
    ],
    config=GenerateContentConfig(
        tools=[
            Tool(
                computer_use=ComputerUse(
                    environment=Environment.ENVIRONMENT_BROWSER,
                    enable_prompt_injection_detection=True,
                ),
            ),
        ],
        thinking_config=ThinkingConfig(
            include_thoughts=True
        ),
    )
)

print(response.text)

JavaScript

使用 @google/genai Node.js SDK 配置以浏览器环境为目标的请求：

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: [
    {
      role: 'user',
      parts: [{ text: "Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th" }]
    }
  ],
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
        enable_prompt_injection_detection: true
      }
    }],
    thinkingConfig: {
      includeThoughts: true
    }
  }
});

console.log(response.text);

REST

使用 curl 发送请求：

curl -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent?key=$GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": {
          "text": "Find me a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th. Start by navigating directly to flights.google.com"
        }
      }
    ],
    "tools": [
      {
        "computer_use": {
          "environment": "ENVIRONMENT_BROWSER",
          "enable_prompt_injection_detection": true
        }
      }
    ]
  }'

Gemini 2.5（旧版）

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]

generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                excluded_predefined_functions=excluded_functions
                )
              ),
          ],
  )

contents=[
    Content(
        role="user",
        parts=[
            Part(text="Search for highly rated smart fridges on Google Shopping."),
        ],
    )
]

response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=contents,
    config=generate_content_config,
)

print(response)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

// Specify predefined functions to exclude (optional)
const excludedFunctions = ["drag_and_drop"];

const response = await ai.models.generateContent({
  model: 'gemini-2.5-computer-use-preview-10-2025',
  contents: [
    {
      role: 'user',
      parts: [{ text: "Search for highly rated smart fridges on Google Shopping." }]
    }
  ],
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
        excluded_predefined_functions: excludedFunctions
      }
    }]
  }
});

console.log(response);

2. 接收模型回答

响应模型建议进行函数调用。对于 Gemini 3.5 Flash，响应包含定制的推理意图以及坐标。以下示例展示了这两种响应：

Gemini 3.5 Flash

{
  "function_call": {
    "name": "click",
    "args": {
      "x": 450,
      "y": 120,
      "intent": "Click the search box to type the destination."
    }
  }
}

Gemini 2.5（旧版）

{
  "content": {
    "parts": [
      {
        "text": "I will type the search query into the search bar."
      },
      {
        "function_call": {
          "name": "type_text_at",
          "args": {
            "x": 371,
            "y": 470,
            "text": "highly rated smart fridges",
            "press_enter": true
          }
        }
      }
    ]
  }
}

3. 执行收到的操作

您的应用代码需要解析模型响应、执行操作并收集结果。

以下代码同时处理旧版工具命令（click_at、type_text_at）和 Gemini 3.5 Flash 精简命令（click、type）。

Python

from typing import Any, List, Tuple
import time

def denormalize_x(x: int, screen_width: int) -> int:
    """Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
    return int(x / 1000 * screen_width)

def denormalize_y(y: int, screen_height: int) -> int:
    """Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
    return int(y / 1000 * screen_height)

def execute_function_calls(interaction, page, screen_width, screen_height):
    results = []
    function_calls = []

    # Parse content parts (Handling legacy and Gemini 3 response structures)
    parts = candidate.content.parts if hasattr(candidate, 'content') else []
    if not parts and hasattr(candidate, 'function_calls'):
        function_calls = candidate.function_calls
    else:
        for part in parts:
            if part.function_call:
                function_calls.append(part.function_call)

    for function_call in function_calls:
        action_result = {}
        fname = function_call.name
        args = function_call.args
        print(f"  -> Executing: {fname} (Intent: {args.get('intent', 'N/A')})")

        try:
            if fname in ("open_web_browser", "open_app"):
                pass # Handled / already open
            elif fname in ("click", "click_at", "double_click", "triple_click", "middle_click", "right_click", "move", "long_press"):
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)

                if fname in ("click", "click_at"):
                    page.mouse.click(actual_x, actual_y)
                elif fname == "double_click":
                    page.mouse.dblclick(actual_x, actual_y)
                elif fname == "right_click":
                    page.mouse.click(actual_x, actual_y, button="right")
                elif fname == "middle_click":
                    page.mouse.click(actual_x, actual_y, button="middle")
                elif fname == "move":
                    page.mouse.move(actual_x, actual_y)
            elif fname in ("type", "type_text_at"):
                actual_x = denormalize_x(args["x"], screen_width) if "x" in args else None
                actual_y = denormalize_y(args["y"], screen_height) if "y" in args else None
                text = args["text"]
                press_enter = args.get("press_enter", False)

                if actual_x is not None and actual_y is not None:
                    page.mouse.click(actual_x, actual_y)
                # Clear field first
                page.keyboard.press("Meta+A")
                page.keyboard.press("Backspace")
                page.keyboard.type(text)
                if press_enter:
                    page.keyboard.press("Enter")
            elif fname == "navigate":
                page.goto(args["url"])
            elif fname == "go_back":
                page.go_back()
            elif fname == "go_forward":
                page.go_forward()
            elif fname == "wait":
                time.sleep(args.get("seconds", 1))
            else:
                print(f"Warning: Custom or unhandled function {fname}")

            page.wait_for_load_state(timeout=5000)
            time.sleep(1)

        except Exception as e:
            print(f"Error executing {fname}: {e}")
            action_result = {"error": str(e)}

        results.append((fname, function_call.id, action_result))

    return results

JavaScript

function denormalizeX(x, screenWidth) {
    // Convert normalized x coordinate (0-1000) to actual pixel coordinate.
    return Math.floor((x / 1000) * screenWidth);
}

function denormalizeY(y, screenHeight) {
    // Convert normalized y coordinate (0-1000) to actual pixel coordinate.
    return Math.floor((y / 1000) * screenHeight);
}

async function executeFunctionCalls(candidate, page, screenWidth, screenHeight) {
    const results = [];
    let functionCalls = [];

    // Parse function calls from candidate response
    const parts = candidate.content?.parts || [];
    if (parts.length === 0 && candidate.functionCalls) {
        functionCalls = candidate.functionCalls;
    } else {
        for (const part of parts) {
            if (part.functionCall) {
                functionCalls.push(part.functionCall);
            }
        }
    }

    for (const functionCall of functionCalls) {
        const actionResult = {};
        const fname = functionCall.name;
        const args = functionCall.args;
        console.log(`  -> Executing: ${fname} (Intent: ${args.intent || 'N/A'})`);

        try {
            if (fname === "open_web_browser" || fname === "open_app") {
                // Handled / already open
            } else if (["click", "click_at", "double_click", "triple_click", "middle_click", "right_click", "move", "long_press"].includes(fname)) {
                const actualX = denormalizeX(args.x, screenWidth);
                const actualY = denormalizeY(args.y, screenHeight);

                if (fname === "click" || fname === "click_at") {
                    await page.mouse.click(actualX, actualY);
                } else if (fname === "double_click") {
                    await page.mouse.dblclick(actualX, actualY);
                } else if (fname === "right_click") {
                    await page.mouse.click(actualX, actualY, { button: "right" });
                } else if (fname === "middle_click") {
                    await page.mouse.click(actualX, actualY, { button: "middle" });
                } else if (fname === "move") {
                    await page.mouse.move(actualX, actualY);
                }
            } else if (fname === "type" || fname === "type_text_at") {
                const actualX = args.x !== undefined ? denormalizeX(args.x, screenWidth) : null;
                const actualY = args.y !== undefined ? denormalizeY(args.y, screenHeight) : null;
                const text = args.text;
                const pressEnter = args.press_enter || false;

                if (actualX !== null && actualY !== null) {
                    await page.mouse.click(actualX, actualY);
                }
                // Clear field first
                await page.keyboard.press("Meta+A");
                await page.keyboard.press("Backspace");
                await page.keyboard.type(text);
                if (pressEnter) {
                    await page.keyboard.press("Enter");
                }
            } else if (fname === "navigate") {
                await page.goto(args.url);
            } else if (fname === "go_back") {
                await page.goBack();
            } else if (fname === "go_forward") {
                await page.goForward();
            } else if (fname === "wait") {
                await new Promise(resolve => setTimeout(resolve, (args.seconds || 1) * 1000));
            } else {
                console.log(`Warning: Custom or unhandled function ${fname}`);
            }

            await page.waitForLoadState('load', { timeout: 5000 }).catch(() => {});
            await new Promise(resolve => setTimeout(resolve, 1000));
        } catch (e) {
            console.log(`Error executing ${fname}: ${e}`);
            actionResult.error = e.message;
        }

        results.push([fname, functionCall.id, actionResult]);
    }

    return results;
}

4. 捕获新环境状态

捕获屏幕表示形式并将其返回给模型。

Python

def get_function_responses(page, results):
    screenshot_bytes = page.screenshot(type="png")
    current_url = page.url
    function_responses = []
    for name, call_id, result in results:
        function_responses.append({
            "type": "function_result",
            "name": name,
            "call_id": call_id,
            "result": [
                {
                    "type": "text",
                    "text": json.dumps({"url": current_url, **result})
                },
                {
                    "type": "image",
                    "data": base64.b64encode(screenshot_bytes).decode("utf-8"),
                    "mime_type": "image/png"
                }
            ]
        })
    return function_responses

JavaScript

async function getFunctionResponses(page, results) {
    const screenshotBuffer = await page.screenshot({ type: 'png' });
    const screenshotBase64 = screenshotBuffer.toString('base64');
    const currentUrl = page.url();
    const functionResponses = [];

    for (const [name, callId, result] of results) {
        functionResponses.push({
            type: "function_result",
            name: name,
            call_id: callId,
            result: [
                {
                    type: "text",
                    text: JSON.stringify({ url: currentUrl, ...result })
                },
                {
                    type: "image",
                    data: screenshotBase64,
                    mime_type: "image/png"
                }
            ]
        });
    }
    return functionResponses;
}

定义如何捕获和设置环境状态的格式后，您可以将所有这些步骤组合成一个持续执行的循环。

构建代理循环

如需实现多步互动，请将如何实现计算机使用部分中的四个步骤合并为一个循环。此循环会一直请求操作并将结果反馈给模型，直到任务完成。

请务必正确管理对话记录，在每个步骤中将模型回答和函数回答都附加到记录中。

Python

import time
from typing import Any, List, Tuple
from playwright.sync_api import sync_playwright
from google import genai
from google.genai import types

client = genai.Client()

SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

print("Initializing browser...")
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT})
page = context.new_page()

# Paste helper functions execute_function_calls and get_function_responses here

try:
    page.goto("https://ai.google.dev/gemini-api/docs")

    config = types.GenerateContentConfig(
        tools=[types.Tool(computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER,
            enable_prompt_injection_detection=True
        ))],
        thinking_config=types.ThinkingConfig(include_thoughts=True),
    )

    initial_screenshot = page.screenshot(type="png")
    USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing."
    print(f"Goal: {USER_PROMPT}")

    contents = [
        types.Content(role="user", parts=[
            types.Part(text=USER_PROMPT),
            types.Part.from_bytes(data=initial_screenshot, mime_type='image/png')
        ])
    ]

    # Agent Loop
    turn_limit = 5
    for i in range(turn_limit):
        print(f"\n--- Turn {i+1} ---")
        print("Thinking...")
        response = client.models.generate_content(
            model='gemini-3.5-flash',
            contents=contents,
            config=config,
        )

        candidate = response.candidates[0]
        contents.append(candidate.content)

        has_function_calls = any(part.function_call for part in candidate.content.parts)
        if not has_function_calls:
            text_response = " ".join(
                part.text for part in candidate.content.parts if hasattr(part, 'text')
            )
            print("Agent finished:", text_response)
            break

        print("Executing actions...")
        results = execute_function_calls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT)

        print("Capturing state...")
        function_responses = get_function_responses(page, results)

        contents.append(
            types.Content(role="user", parts=[types.Part(function_response=fr) for fr in function_responses])
        )

finally:
    print("Closing browser...")
    browser.close()
    playwright.stop()

JavaScript

import { chromium } from 'playwright';
import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

// Constants for screen dimensions
const SCREEN_WIDTH = 1440;
const SCREEN_HEIGHT = 900;

console.log("Initializing browser...");
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext({
    viewport: { width: SCREEN_WIDTH, height: SCREEN_HEIGHT }
});
const page = await context.newPage();

// Define helper functions. Copy/paste from steps 3 and 4:
// function denormalizeX(...)
// function denormalizeY(...)
// async function executeFunctionCalls(...)
// async function getFunctionResponses(...)

try {
    await page.goto("https://ai.google.dev/gemini-api/docs");

    const config = {
        tools: [{
            computerUse: {
                environment: "ENVIRONMENT_BROWSER",
                enable_prompt_injection_detection: true
            }
        }],
        thinkingConfig: { includeThoughts: true }
    };

    const initialScreenshotBuffer = await page.screenshot({ type: 'png' });
    const initialScreenshotBase64 = initialScreenshotBuffer.toString('base64');
    const USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing.";
    console.log(`Goal: ${USER_PROMPT}`);

    const contents = [
        {
            role: "user",
            parts: [
                { text: USER_PROMPT },
                {
                    inlineData: {
                        data: initialScreenshotBase64,
                        mimeType: "image/png"
                    }
                }
            ]
        }
    ];

    // Agent Loop
    const turnLimit = 5;
    for (let i = 0; i < turnLimit; i++) {
        console.log(`\n--- Turn ${i + 1} ---`);
        console.log("Thinking...");
        const response = await ai.models.generateContent({
            model: 'gemini-3.5-flash',
            contents: contents,
            config: config
        });

        const candidate = response.candidates[0];
        contents.push(candidate.content);

        const hasFunctionCalls = candidate.content.parts.some(part => part.functionCall);
        if (!hasFunctionCalls) {
            const textResponse = candidate.content.parts
                .filter(part => part.text)
                .map(part => part.text)
                .join(" ");
            console.log("Agent finished:", textResponse);
            break;
        }

        console.log("Executing actions...");
        const results = await executeFunctionCalls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT);

        console.log("Capturing state...");
        const functionResponses = await getFunctionResponses(page, results);

        contents.push({
            role: "user",
            parts: functionResponses.map(fr => ({
                ...fr
            }))
        });
    }
} finally {
    console.log("Closing browser...");
    await browser.close();
}

支持的环境 (Gemini 3.5 Flash)

Gemini 3.5 Flash 支持 computer_use 配置中指定的三种环境：

浏览器环境 (`ENVIRONMENT_BROWSER`)

浏览器工具下的操作操作：

命令名称	说明	实参（在函数调用中）
click	在相应坐标处点击左键。	`y`：int (0-999) `x`：int (0-999) `intent`：str
double_click	在相应坐标处双击。	`y`：int (0-999) `x`：int (0-999) `intent`：str
triple_click	在相应坐标处点击三次。	`y`：int (0-999) `x`：int (0-999) `intent`：str
middle_click	在相应坐标处点击鼠标中键。	`y`：int (0-999) `x`：int (0-999) `intent`：str
right_click	在相应坐标处进行右键点击。	`y`：int (0-999) `x`：int (0-999) `intent`：str
mouse_down	按住相应坐标处的鼠标按钮。	`y`：int (0-999) `x`：int (0-999) `intent`：str
mouse_up	在指定坐标处释放鼠标按钮。	`y`：int (0-999) `x`：int (0-999) `intent`：str
move	将光标移动到指定位置。	`y`：int (0-999) `x`：int (0-999) `intent`：str
type	输入文字。	`text`：str `press_enter`：bool（可选，默认值为 `false`） `intent`：str
drag_and_drop	将商品从起始坐标拖动到结束坐标。	`start_y`：int (0-999) `start_x`：int (0-999) `end_y`：int (0-999) `end_x`：int (0-999) `intent`：str
wait	暂停执行指定秒数。	`seconds`：int（可选，默认值为 `1`） `intent`：str
press_key	按下并释放指定键。	`key`：str `intent`：str
key_down	按下并按住指定的键。	`key`：str `intent`：str
key_up	释放指定的键。	`key`：str `intent`：str
热键	按下指定的组合键。	`keys`：`List[str]` `intent`：`str`
take_screenshot	返回当前屏幕的屏幕截图。	`intent`：str
scroll	按像素距离在某个坐标处向上、向下、向左或向右滚动。	`y`：int (0-999) `x`：int (0-999) `direction`：str（`"up"`、`"down"`、`"left"`、`"right"`） `magnitude_in_pixels`：int（0-999，可选，默认值为 `300`） `intent`：str
go_back	返回到浏览器历史记录中的上一个网页。	`intent`：str
navigate	直接前往指定网址。	`url`：str `intent`：str
go_forward	在浏览器历史记录中向前导航到下一个网页。	`intent`：str

移动环境 (`ENVIRONMENT_MOBILE`)

Android 优化环境操作：

命令名称	说明	实参（在函数调用中）
open_app	按名称打开应用。	`app_name`：str `intent`：str
click	在相应坐标处点击左键。	`y`：int (0-999) `x`：int (0-999) `intent`：str
list_apps	列出设备上可用的应用，并返回其名称和软件包名称。	`intent`：str
wait	暂停执行指定秒数。	`seconds`：int（可选，默认值为 `1`） `intent`：str
go_back	返回到上一个界面或网页。	`intent`：str
type	输入文字。	`text`：str `press_enter`：bool（可选，默认值为 `false`） `intent`：str
drag_and_drop	将商品从起始坐标拖动到结束坐标。	`start_y`：int (0-999) `start_x`：int (0-999) `end_y`：int (0-999) `end_x`：int (0-999) `intent`：str
long_press	在屏幕上的某个坐标处执行长按操作。	`y`：int (0-999) `x`：int (0-999) `seconds`：int（可选，默认值为 `2`） `intent`：str
press_key	按下并释放指定键。	`key`：str `intent`：str
take_screenshot	返回当前屏幕的屏幕截图。	`intent`：str

桌面环境 (`ENVIRONMENT_DESKTOP`)

桌面环境操作系统级光标命令：

命令名称	说明	实参（在函数调用中）
click	在相应坐标处点击左键。	`y`：int (0-999) `x`：int (0-999) `intent`：str
double_click	在相应坐标处双击。	`y`：int (0-999) `x`：int (0-999) `intent`：str
triple_click	在相应坐标处点击三次。	`y`：int (0-999) `x`：int (0-999) `intent`：str
middle_click	在相应坐标处点击鼠标中键。	`y`：int (0-999) `x`：int (0-999) `intent`：str
right_click	在相应坐标处进行右键点击。	`y`：int (0-999) `x`：int (0-999) `intent`：str
mouse_down	按住相应坐标处的鼠标按钮。	`y`：int (0-999) `x`：int (0-999) `intent`：str
mouse_up	在指定坐标处释放鼠标按钮。	`y`：int (0-999) `x`：int (0-999) `intent`：str
move	将光标移动到指定位置。	`y`：int (0-999) `x`：int (0-999) `intent`：str
type	输入文字。	`text`：str `press_enter`：bool（可选，默认值为 `false`） `intent`：str
drag_and_drop	将商品从起始坐标拖动到结束坐标。	`start_y`：int (0-999) `start_x`：int (0-999) `end_y`：int (0-999) `end_x`：int (0-999) `intent`：str
wait	暂停执行指定秒数。	`seconds`：int（可选，默认值为 `1`） `intent`：str
press_key	按下并释放指定键。	`key`：str `intent`：str
key_down	按下并按住指定的键。	`key`：str `intent`：str
key_up	释放指定的键。	`key`：str `intent`：str
热键	按下指定的组合键。	`keys`：`List[str]` `intent`：`str`
take_screenshot	返回当前屏幕的屏幕截图。	`intent`：str
scroll	按像素距离在某个坐标处向上、向下、向左或向右滚动。	`y`：int (0-999) `x`：int (0-999) `direction`：str（`"up"`、`"down"`、`"left"`、`"right"`） `magnitude_in_pixels`：int（0-999，可选，默认值为 `300`） `intent`：str

旧版支持的界面操作 (Gemini 2.5)

对于旧版模型 (gemini-2.5-computer-use-preview-10-2025)，支持以下操作：

命令名称	说明	实参（在函数调用中）	函数调用示例
open_web_browser	打开网络浏览器。	无	`{"name": "open_web_browser", "args": {}}`
wait_5_seconds	暂停执行 5 秒。	无	`{"name": "wait_5_seconds", "args": {}}`
go_back	前往历史记录中的上一页。	无	`{"name": "go_back", "args": {}}`
go_forward	前往历史记录中的下一页。	无	`{"name": "go_forward", "args": {}}`
search	导航到默认搜索引擎。	无	`{"name": "search", "args": {}}`
navigate	直接将浏览器导航到指定网址。	`url`：str	`{"name": "navigate", "args": {"url": "https://www.wikipedia.org"}}`
click_at	特定坐标处的点击次数。	`y`：int (0-999)，`x`：int (0-999)	`{"name": "click_at", "args": {"y": 300, "x": 500}}`
hover_at	将鼠标悬停在特定坐标处。	`y`：int (0-999)，`x`：int (0-999)	`{"name": "hover_at", "args": {"y": 150, "x": 250}}`
type_text_at	在某个坐标处输入文字。	`y`：int (0-999)，`x`：int (0-999)，`text`：str，`press_enter`：bool（可选，默认值为 True），`clear_before_typing`：bool（可选，默认值为 True）	`{"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search", "press_enter": false}}`
key_combination	按相应按键或组合键。	`keys`：str	`{"name": "key_combination", "args": {"keys": "Control+A"}}`
scroll_document	滚动浏览整个网页。	`direction`：str	`{"name": "scroll_document", "args": {"direction": "down"}}`
scroll_at	在坐标 (x,y) 处滚动。	`y`：int，`x`：int，`direction`：str，`magnitude`：int（可选，默认值为 800）	`{"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down"}}`
drag_and_drop	在两个坐标之间拖动。	`y`：int，`x`：int，`destination_y`：int，`destination_x`：int	`{"name": "drag_and_drop", "args": {"y": 100, "destination_y": 500, "destination_x": 500, "x": 100}}`

自定义用户定义的函数

您可以通过添加自定义的用户定义的函数来扩展模型的功能。例如，在人机协同 (HITL) 场景中，您可以排除默认的预定义操作并注册自定义操作。

Gemini 3.5 Flash 自定义工具

Python

排除标准预定义的浏览器操作（例如 click），并注册自定义 yield_to_user 工具：

from google import genai
from google.genai import types

client = genai.Client()

yield_to_user_tool = types.FunctionDeclaration(
    name="yield_to_user",
    description="Yields control back to the user for assistance or verification when an automated action is unsafe or ambiguous.",
    parameters=types.Schema(
        type="OBJECT",
        properties={
            "reason": types.Schema(
                type="STRING",
                description="The reason why the agent is yielding control to the human."
            )
        },
        required=["reason"]
    )
)

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Click the submit button. If you need a second factor authentication code, ask me.",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment="ENVIRONMENT_MOBILE",
                    excluded_predefined_functions=["click"]
                )
            ),
            yield_to_user_tool
        ]
    )
)

Gemini 2.5（旧版）自定义工具

Python

from typing import Optional, Dict, Any
from google import genai
from google.genai import types

client = genai.Client()

# Define custom tools here
custom_functions = [...] # Describe parameters as FunctionDeclaration object

def make_generate_content_config():
    excluded_functions = ["open_web_browser", "wait_5_seconds", "go_back", "go_forward", "search", "navigate", "hover_at", "scroll_document", "key_combination", "drag_and_drop"]
    generate_content_config = types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER,
                    excluded_predefined_functions=excluded_functions
                )
            ),
            types.Tool(function_declarations=custom_functions)
        ]
    )
    return generate_content_config

管理思维水平 (Gemini 3.5 Flash)

对于计算机使用代理，您可以配置不同的思考级别，以平衡行动质量和执行速度。较低的思考水平通常可以在标准自动化任务中实现良好的平衡。

安全

配置安全政策 (Gemini 3.5 Flash)

Gemini 3.5 Flash 模型包含内置的安全服务类别，可自动确定是否需要用户确认。

安全政策类别	说明
`FINANCIAL_TRANSACTIONS`	阻止或触发涉及付款、零售结账或管制商品的交易的确认。
`SENSITIVE_DATA_MODIFICATION`	保护健康记录、财务记录或政府记录免遭未经授权的修改。
`COMMUNICATION_TOOL`	限制代理自主发送电子邮件、聊天消息或草稿。
`ACCOUNT_CREATION`	限制代理在网站上自主注册新账号。
`DATA_MODIFICATION`	用于规范整体文件系统修改、数据共享和存储删除。
`USER_CONSENT_MANAGEMENT`	需要用户接管 Cookie 意见征求横幅和隐私权提示。
`LEGAL_TERMS_AND_AGREEMENTS`	防止模型自主接受服务条款或具有法律约束力的合同。

安全替换项

您可以通过传递替换项来替换所选政策：

Python

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Clean up the local folder by archiving old logs.",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_DESKTOP,
                    safety_policy_overrides=[
                        types.SafetyPolicyOverride(category="DATA_MODIFICATION")
                    ]
                )
            )
        ]
    )
)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: "Clean up the local folder by archiving old logs.",
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_DESKTOP",
        safety_policy_overrides: [
          { category: "DATA_MODIFICATION" }
        ]
      }
    }]
  }
});

提示注入检测 (Gemini 3.5 Flash)

一种选择启用的安全机制，可扫描屏幕截图像素，查找隐藏的对抗性提示指令（例如“忽略之前的命令”），并在检测到时阻止执行。

确认安全决策（Gemini 2.5 旧版）

对于旧版模型，响应可能包含 safety_decision 参数：

{
  "function_call": {
    "name": "click_at",
    "args": {
      "x": 60,
      "y": 100,
      "safety_decision": {
        "explanation": "Must check check-box",
        "decision": "require_confirmation"
      }
    }
  }
}

如果 safety_decision 为 require_confirmation，则提示最终用户。如果用户确认，请在 FunctionResponse 中设置 safety_acknowledgement。

Python

def get_safety_confirmation(safety_decision):
    # Prompt user
    return "CONTINUE" # Or TERMINATE

# Inside execute_function_calls:
if 'safety_decision' in function_call.args:
    decision = get_safety_confirmation(function_call.args['safety_decision'])
    if decision == "TERMINATE":
        break
    extra_fr_fields["safety_acknowledgement"] = "true"

有关安全的最佳实践

计算机使用会带来独特的安全和操作风险，因为代表用户执行操作的模型可能会遇到屏幕上的不受信任的内容，或者在执行操作时出错。实施以下最佳实践，以保护用户数据和系统：

人机协同 (HITL)：

强制要求用户确认：当安全响应指示为 require_confirmation（或旧版安全决策要求这样做）时，提示用户进行审批。

提供自定义安全指令：实现自定义系统指令，以定义和强制执行您自己的安全边界。例如：

Python

from google import genai
from google.genai import types

system_instruction = """
## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**

This is your first and most important check. If the next required action falls
into any of the following categories, you MUST stop immediately, and seek the
user's explicit permission.

**Procedure for Seeking Confirmation:**
* **For Consequential Actions:** Perform all preparatory steps (e.g., navigating,
  filling out forms, typing a message). You will ask for confirmation **AFTER**
  all necessary information is entered on the screen, but **BEFORE** you perform
  the final, irreversible action (e.g., before clicking "Send", "Submit",
  "Confirm Purchase", "Share").
* **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting
  legal terms, solving a CAPTCHA), you must first inform the user about the
  required action and ask for their confirmation to proceed.

**USER_CONFIRMATION Categories:**

*   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
    agreeing to any of the following on the user's behalf. You must ask the
    user to confirm before performing these actions.
    *   Terms of Service
    *   Privacy Policies
    *   Cookie consent banners
    *   End User License Agreements (EULAs)
    *   Any other legally significant contracts or agreements.
*   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
    following. You must ask the user to confirm before performing these actions.
    *   CAPTCHAs (of any kind)
    *   Any other anti-robot or human-verification mechanisms, even if you are
        capable.
*   **Financial Transactions:**
    *   Completing any purchase.
    *   Managing or moving money (e.g., transfers, payments).
    *   Purchasing regulated goods or participating in gambling.
*   **Sending Communications:**
    *   Sending emails.
    *   Sending messages on any platform (e.g., social media, chat apps).
    *   Posting content on social media or forums.
*   **Accessing or Modifying Sensitive Information:**
    *   Health, financial, or government records (e.g., medical history, tax
        forms, passport status).
    *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
        account number, credit card number).
*   **User Data Management:**
    *   Accessing, downloading, or saving files from the web.
    *   Sharing or sending files/data to any third party.
    *   Transferring user data between systems.
*   **Browser Data Usage:**
    *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
        or saved passwords.
*   **Security and Identity:**
    *   Logging into any user account.
    *   Any action that involves misrepresentation or impersonation (e.g.,
        creating a fan account, posting as someone else).
*   **Insurmountable Obstacles:** If you are technically unable to interact with
    a user interface element or are stuck in a loop you cannot resolve, ask the
    user to take over.
---

## **RULE 2: Default Behavior (ACTUATE)**

If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
your default behavior is to **Actuate**.

**Actuation Means:**  You MUST proactively perform all necessary steps to move
the user's request forward. Continue to actuate until you either complete the
non-consequential task or encounter a condition defined in Rule 1.

*   **Example 1:** If asked to send money, you will navigate to the payment
    portal, enter the recipient's details, and enter the amount. You will then
    **STOP** as per Rule 1 and ask for confirmation before clicking the final
    "Send" button.
*   **Example 2:** If asked to post a message, you will navigate to the site,
    open the post composition window, and write the full message. You will then
    **STOP** as per Rule 1 and ask for confirmation before clicking the final
    "Post" button.

    After the user has confirmed, remember to get the user's latest screen
    before continuing to perform actions.

# Final Response Guidelines:
Write final response to the user in the following cases:
- User confirmation
- When the task is complete or you have enough information to respond to the user
"""

client = genai.Client()
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Prepare a draft but do not send.",
    config=types.GenerateContentConfig(
        system_instruction=system_instruction,
        tools=[types.Tool(computer_use=types.ComputerUse(environment="ENVIRONMENT_BROWSER"))]
    )
)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const systemInstruction = `
## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**

This is your first and most important check. If the next required action falls
into any of the following categories, you MUST stop immediately, and seek the
user's explicit permission.

**Procedure for Seeking Confirmation:**
* **For Consequential Actions:** Perform all preparatory steps (e.g., navigating,
  filling out forms, typing a message). You will ask for confirmation **AFTER**
  all necessary information is entered on the screen, but **BEFORE** you perform
  the final, irreversible action (e.g., before clicking "Send", "Submit",
  "Confirm Purchase", "Share").
* **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting
  legal terms, solving a CAPTCHA), you must first inform the user about the
  required action and ask for their confirmation to proceed.

**USER_CONFIRMATION Categories:**

*   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
    agreeing to any of the following on the user's behalf. You must ask the
    user to confirm before performing these actions.
    *   Terms of Service
    *   Privacy Policies
    *   Cookie consent banners
    *   End User License Agreements (EULAs)
    *   Any other legally significant contracts or agreements.
*   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
    following. You must ask the user to confirm before performing these actions.
    *   CAPTCHAs (of any kind)
    *   Any other anti-robot or human-verification mechanisms, even if you are
        capable.
*   **Financial Transactions:**
    *   Compleying any purchase.
    *   Managing or moving money (e.g., transfers, payments).
    *   Purchasing regulated goods or participating in gambling.
*   **Sending Communications:**
    *   Sending emails.
    *   Sending messages on any platform (e.g., social media, chat apps).
    *   Posting content on social media or forums.
*   **Accessing or Modifying Sensitive Information:**
    *   Health, financial, or government records (e.g., medical history, tax
        forms, passport status).
    *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
        account number, credit card number).
*   **User Data Management:**
    *   Accessing, downloading, or saving files from the web.
    *   Sharing or sending files/data to any third party.
    *   Transferring user data between systems.
*   **Browser Data Usage:**
    *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
        or saved passwords.
*   **Security and Identity:**
    *   Logging into any user account.
    *   Any action that involves misrepresentation or impersonation (e.g.,
        creating a fan account, posting as someone else).
*   **Insurmountable Obstacles:** If you are technically unable to interact with
    a user interface element or are stuck in a loop you cannot resolve, ask the
    user to take over.
---

## **RULE 2: Default Behavior (ACTUATE)**

If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
your default behavior is to **Actuate**.

**Actuation Means:**  You MUST proactively perform all necessary steps to move
the user's request forward. Continue to actuate until you either complete the
non-consequential task or encounter a condition defined in Rule 1.

*   **Example 1:** If asked to send money, you will navigate to the payment
    portal, enter the recipient's details, and enter the amount. You will then
    **STOP** as per Rule 1 and ask for confirmation before clicking the final
    "Send" button.
*   **Example 2:** If asked to post a message, you will navigate to the site,
    open the post composition window, and write the full message. You will then
    **STOP** as per Rule 1 and ask for confirmation before clicking the final
    "Post" button.

    After the user has confirmed, remember to get the user's latest screen
    before continuing to perform actions.

# Final Response Guidelines:
Write final response to the user in the following cases:
- User confirmation
- When the task is complete or you have enough information to respond to the user
`;

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: "Prepare a draft but do not send.",
  config: {
    systemInstruction: systemInstruction,
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER"
      }
    }]
  }
});

安全执行环境：在安全的沙盒环境中运行代理，以限制其潜在影响。这可以是沙盒虚拟机 (VM)、容器（例如 Docker）或权限有限的专用浏览器配置文件。如需了解使用 Docker 设置沙盒的指南，请参阅 GitHub 参考实现。
输入内容清理：清理提示中的所有用户生成的文本，以降低意外指令或提示注入的风险。这是一个有用的安全层，但不能替代安全执行环境。
内容安全措施：使用安全措施和内容安全 API 来评估用户输入、工具输入和输出以及代理的回答是否合适，并检测提示注入和越狱攻击。
许可名单和屏蔽名单：实现过滤机制，以控制模型可以访问的网站以及可以执行的操作。禁止访问的网站的屏蔽名单是一个不错的起点，而限制性更强的许可名单则更加安全。
可观测性和日志记录：维护详细的日志，以便进行调试、审核和突发事件响应。您的客户端应记录提示、屏幕截图、模型建议的操作 (function_call)、安全响应以及客户端最终执行的所有操作。
环境管理：确保 GUI 环境保持一致。意外的弹出式窗口、通知或布局变化可能会让模型感到困惑。尽可能从已知干净状态开始执行每个新任务。

模型版本

您可以在以下模型上使用“计算机使用”工具：

Gemini 3.5 Flash (gemini-3.5-flash)：推荐用于计算机的模型，具有精简的意图操作、支持浏览器、移动设备和桌面环境、可配置的安全政策以及提示注入检测功能。
Gemini 3 Flash 预览版 (gemini-3-flash-preview)：支持在电脑上使用的预览版模型。
Gemini 2.5（旧版预览版）(gemini-2.5-computer-use-preview-10-2025)：针对基于浏览器的计算机使用场景优化的旧版预览模型。

后续步骤

在 Browserbase 演示环境中尝试使用计算机。
如需查看示例代码，请参阅参考实现。
了解其他 Gemini API 工具：
- 函数调用
- 使用 Google 搜索建立依据

计算机使用

Python

JavaScript

“计算机使用”模型的工作原理

如何实现“计算机使用”

0. 设置 Playwright

1. 向模型发送请求

Gemini 3.5 Flash（推荐）

Python

JavaScript

REST

Gemini 2.5（旧版）

Python

JavaScript

2. 接收模型回答

Gemini 3.5 Flash

Gemini 2.5（旧版）

3. 执行收到的操作

Python

JavaScript

4. 捕获新环境状态

Python

JavaScript

构建代理循环

Python

JavaScript

支持的环境 (Gemini 3.5 Flash)

浏览器环境 (ENVIRONMENT_BROWSER)

移动环境 (ENVIRONMENT_MOBILE)

桌面环境 (ENVIRONMENT_DESKTOP)

旧版支持的界面操作 (Gemini 2.5)

自定义用户定义的函数

Gemini 3.5 Flash 自定义工具

Python

Gemini 2.5（旧版）自定义工具

Python

管理思维水平 (Gemini 3.5 Flash)

安全

配置安全政策 (Gemini 3.5 Flash)

安全替换项

Python

JavaScript

提示注入检测 (Gemini 3.5 Flash)

确认安全决策（Gemini 2.5 旧版）

Python

有关安全的最佳实践

Python

JavaScript

模型版本

后续步骤

浏览器环境 (`ENVIRONMENT_BROWSER`)

移动环境 (`ENVIRONMENT_MOBILE`)

桌面环境 (`ENVIRONMENT_DESKTOP`)