Computer Use

The Computer Use tool lets you build browser, mobile, and desktop control agents that interact with and automate tasks. Using screenshots, the model can "see" a computer screen, and "act" by generating specific UI actions like mouse clicks and keyboard inputs. Similar to function calling, you will need to implement the client-side execution environment to receive and execute the Computer Use actions.

Gemini 3.5 Flash is the recommended model for Computer Use, and introduces several new capabilities:

  • Multi-environment support: build agents for browser, mobile, and desktop environments.
  • Streamlined actions with intents: actions include an intent field that explains the model's reasoning behind each step.
  • Configurable safety policies: fine-tune safety behavior with built-in policy categories and overrides.
  • Prompt injection detection: opt-in screenshot scanning to detect hidden adversarial instructions.

With Computer Use, you can build agents that:

  • Automate repetitive data entry or form filling on websites.
  • Perform automated testing of web applications and user flows
  • Conduct research across various websites (e.g., gathering product information, prices, and reviews from ecommerce sites to inform a purchase)

Here's a minimal example of enabling the Computer Use tool:

Python

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Search for 'Gemini API' on Google.",
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
            )
        )]
    )
)

print(response.text)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: "Search for 'Gemini API' on Google.",
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
      }
    }]
  }
});

console.log(response.text);

How Computer Use works

To build an agent with the Computer Use model, you need to set up a continuous loop between your application and the API. Here is what your code will do at each step:

  1. Send a request to the model
    • Your application sends an API request containing the Computer Use tool, your configuration settings (like the target environment), the user's prompt, and a screenshot of the current screen.
  2. Receive the model response
    • The model analyzes the screen and the prompt, returning a response which includes a suggested function_call representing a UI action (such as a click, scroll, or keystroke).
    • For Gemini 3.5 Flash, the response also includes a reasoning intent explaining why the model chose that action.
    • For legacy models (such as gemini-2.5-computer-use-preview-10-2025), the response may include a safety_decision from an internal safety system that classifies the action as regular/allowed, require_confirmation (requiring user approval), or blocked.
  3. Execute the received action
    • If the action is allowed (or the user confirms it), your client-side code parses the function_call, scales the normalized coordinates to match your viewport, and executes the action in your target environment using automation tools (such as Playwright). If the action is blocked, your client should halt the execution or handle the interruption.
  4. Capture the new environment state
    • After the action finishes executing, your application captures a new screenshot and sends it back to the model in a function_result to request the next step.

This process then repeats from step 2, continually soliciting the next action from the model until the task is completed or terminated.

Computer Use overview

How to implement Computer Use

Before building with the Computer Use tool you will need to set up:

  • Secure execution environment: Run your agent in a sandboxed VM or container to isolate it from your host system and limit its potential impact. The reference implementation includes a ready-to-use Docker-based sandbox you can use as a starting point.
  • Client-side action handler: Implement client-side logic to execute coordinates, type text, and take screenshots.

The examples below use a web browser as the execution environment and Playwright as the client-side handler.

0. Set up Playwright

First, install the required packages:

pip install google-genai playwright
playwright install chromium

Then, initialize a Playwright browser instance to use for execution:

from playwright.sync_api import sync_playwright

# 1. Configure screen dimensions for the target environment
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

# 2. Start the Playwright browser
# In production, utilize a sandboxed environment.
playwright = sync_playwright().start()
# Set headless=False to see the actions performed on your screen
browser = playwright.chromium.launch(headless=False)

# 3. Create a context and page with the specified dimensions
context = browser.new_context(
    viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()

# 4. Navigate to an initial page to start the task
page.goto("https://www.google.com")

# The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
# will be used in the steps below.

1. Send a request to the model

Initialize the client library and configure the Computer Use tool. Note that there is no need to specify the display size when issuing a request; the model predicts pixel coordinates scaled to the height and width of the screen.

Python

Use the google-genai Python SDK (version 2.7.0 or higher) to configure a request targeting the browser environment:

from google import genai
from google.genai.types import (
    Content,
    Part,
    GenerateContentConfig,
    Tool,
    ComputerUse,
    Environment,
    ThinkingConfig,
)

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=[
        Content(
            role="user",
            parts=[
                Part(text="Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th"),
            ],
        )
    ],
    config=GenerateContentConfig(
        tools=[
            Tool(
                computer_use=ComputerUse(
                    environment=Environment.ENVIRONMENT_BROWSER,
                    enable_prompt_injection_detection=True,
                ),
            ),
        ],
        thinking_config=ThinkingConfig(
            include_thoughts=True
        ),
    )
)

print(response.text)

JavaScript

Use the @google/genai Node.js SDK to configure a request targeting the browser environment:

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: [
    {
      role: 'user',
      parts: [{ text: "Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th" }]
    }
  ],
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
        enable_prompt_injection_detection: true
      }
    }],
    thinkingConfig: {
      includeThoughts: true
    }
  }
});

console.log(response.text);

REST

Use curl to send a request:

curl -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent?key=$GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": {
          "text": "Find me a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th. Start by navigating directly to flights.google.com"
        }
      }
    ],
    "tools": [
      {
        "computer_use": {
          "environment": "ENVIRONMENT_BROWSER",
          "enable_prompt_injection_detection": true
        }
      }
    ]
  }'

Gemini 2.5 (Legacy)

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]

generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                excluded_predefined_functions=excluded_functions
                )
              ),
          ],
  )

contents=[
    Content(
        role="user",
        parts=[
            Part(text="Search for highly rated smart fridges on Google Shopping."),
        ],
    )
]

response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=contents,
    config=generate_content_config,
)

print(response)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

// Specify predefined functions to exclude (optional)
const excludedFunctions = ["drag_and_drop"];

const response = await ai.models.generateContent({
  model: 'gemini-2.5-computer-use-preview-10-2025',
  contents: [
    {
      role: 'user',
      parts: [{ text: "Search for highly rated smart fridges on Google Shopping." }]
    }
  ],
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_BROWSER",
        excluded_predefined_functions: excludedFunctions
      }
    }]
  }
});

console.log(response);

2. Receive the model response

The response model suggests a function call. For Gemini 3.5 Flash, the response contains a tailored reasoning intent alongside coordinates. The following shows examples of both responses:

Gemini 3.5 Flash

{
  "function_call": {
    "name": "click",
    "args": {
      "x": 450,
      "y": 120,
      "intent": "Click the search box to type the destination."
    }
  }
}

Gemini 2.5 (Legacy)

{
  "content": {
    "parts": [
      {
        "text": "I will type the search query into the search bar."
      },
      {
        "function_call": {
          "name": "type_text_at",
          "args": {
            "x": 371,
            "y": 470,
            "text": "highly rated smart fridges",
            "press_enter": true
          }
        }
      }
    ]
  }
}

3. Execute the received actions

Your application code needs to parse the model response, execute the actions, and collect the results.

The code below handles both legacy tool commands (click_at, type_text_at) and Gemini 3.5 Flash streamlined commands (click, type).

Python

from typing import Any, List, Tuple
import time

def denormalize_x(x: int, screen_width: int) -> int:
    """Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
    return int(x / 1000 * screen_width)

def denormalize_y(y: int, screen_height: int) -> int:
    """Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
    return int(y / 1000 * screen_height)

def execute_function_calls(interaction, page, screen_width, screen_height):
    results = []
    function_calls = []

    # Parse content parts (Handling legacy and Gemini 3 response structures)
    parts = candidate.content.parts if hasattr(candidate, 'content') else []
    if not parts and hasattr(candidate, 'function_calls'):
        function_calls = candidate.function_calls
    else:
        for part in parts:
            if part.function_call:
                function_calls.append(part.function_call)

    for function_call in function_calls:
        action_result = {}
        fname = function_call.name
        args = function_call.args
        print(f"  -> Executing: {fname} (Intent: {args.get('intent', 'N/A')})")

        try:
            if fname in ("open_web_browser", "open_app"):
                pass # Handled / already open
            elif fname in ("click", "click_at", "double_click", "triple_click", "middle_click", "right_click", "move", "long_press"):
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)

                if fname in ("click", "click_at"):
                    page.mouse.click(actual_x, actual_y)
                elif fname == "double_click":
                    page.mouse.dblclick(actual_x, actual_y)
                elif fname == "right_click":
                    page.mouse.click(actual_x, actual_y, button="right")
                elif fname == "middle_click":
                    page.mouse.click(actual_x, actual_y, button="middle")
                elif fname == "move":
                    page.mouse.move(actual_x, actual_y)
            elif fname in ("type", "type_text_at"):
                actual_x = denormalize_x(args["x"], screen_width) if "x" in args else None
                actual_y = denormalize_y(args["y"], screen_height) if "y" in args else None
                text = args["text"]
                press_enter = args.get("press_enter", False)

                if actual_x is not None and actual_y is not None:
                    page.mouse.click(actual_x, actual_y)
                # Clear field first
                page.keyboard.press("Meta+A")
                page.keyboard.press("Backspace")
                page.keyboard.type(text)
                if press_enter:
                    page.keyboard.press("Enter")
            elif fname == "navigate":
                page.goto(args["url"])
            elif fname == "go_back":
                page.go_back()
            elif fname == "go_forward":
                page.go_forward()
            elif fname == "wait":
                time.sleep(args.get("seconds", 1))
            else:
                print(f"Warning: Custom or unhandled function {fname}")

            page.wait_for_load_state(timeout=5000)
            time.sleep(1)

        except Exception as e:
            print(f"Error executing {fname}: {e}")
            action_result = {"error": str(e)}

        results.append((fname, function_call.id, action_result))

    return results

JavaScript

function denormalizeX(x, screenWidth) {
    // Convert normalized x coordinate (0-1000) to actual pixel coordinate.
    return Math.floor((x / 1000) * screenWidth);
}

function denormalizeY(y, screenHeight) {
    // Convert normalized y coordinate (0-1000) to actual pixel coordinate.
    return Math.floor((y / 1000) * screenHeight);
}

async function executeFunctionCalls(candidate, page, screenWidth, screenHeight) {
    const results = [];
    let functionCalls = [];

    // Parse function calls from candidate response
    const parts = candidate.content?.parts || [];
    if (parts.length === 0 && candidate.functionCalls) {
        functionCalls = candidate.functionCalls;
    } else {
        for (const part of parts) {
            if (part.functionCall) {
                functionCalls.push(part.functionCall);
            }
        }
    }

    for (const functionCall of functionCalls) {
        const actionResult = {};
        const fname = functionCall.name;
        const args = functionCall.args;
        console.log(`  -> Executing: ${fname} (Intent: ${args.intent || 'N/A'})`);

        try {
            if (fname === "open_web_browser" || fname === "open_app") {
                // Handled / already open
            } else if (["click", "click_at", "double_click", "triple_click", "middle_click", "right_click", "move", "long_press"].includes(fname)) {
                const actualX = denormalizeX(args.x, screenWidth);
                const actualY = denormalizeY(args.y, screenHeight);

                if (fname === "click" || fname === "click_at") {
                    await page.mouse.click(actualX, actualY);
                } else if (fname === "double_click") {
                    await page.mouse.dblclick(actualX, actualY);
                } else if (fname === "right_click") {
                    await page.mouse.click(actualX, actualY, { button: "right" });
                } else if (fname === "middle_click") {
                    await page.mouse.click(actualX, actualY, { button: "middle" });
                } else if (fname === "move") {
                    await page.mouse.move(actualX, actualY);
                }
            } else if (fname === "type" || fname === "type_text_at") {
                const actualX = args.x !== undefined ? denormalizeX(args.x, screenWidth) : null;
                const actualY = args.y !== undefined ? denormalizeY(args.y, screenHeight) : null;
                const text = args.text;
                const pressEnter = args.press_enter || false;

                if (actualX !== null && actualY !== null) {
                    await page.mouse.click(actualX, actualY);
                }
                // Clear field first
                await page.keyboard.press("Meta+A");
                await page.keyboard.press("Backspace");
                await page.keyboard.type(text);
                if (pressEnter) {
                    await page.keyboard.press("Enter");
                }
            } else if (fname === "navigate") {
                await page.goto(args.url);
            } else if (fname === "go_back") {
                await page.goBack();
            } else if (fname === "go_forward") {
                await page.goForward();
            } else if (fname === "wait") {
                await new Promise(resolve => setTimeout(resolve, (args.seconds || 1) * 1000));
            } else {
                console.log(`Warning: Custom or unhandled function ${fname}`);
            }

            await page.waitForLoadState('load', { timeout: 5000 }).catch(() => {});
            await new Promise(resolve => setTimeout(resolve, 1000));
        } catch (e) {
            console.log(`Error executing ${fname}: ${e}`);
            actionResult.error = e.message;
        }

        results.push([fname, functionCall.id, actionResult]);
    }

    return results;
}

4. Capture the new environment state

Capture a screen representation and return it to the model.

Python

def get_function_responses(page, results):
    screenshot_bytes = page.screenshot(type="png")
    current_url = page.url
    function_responses = []
    for name, call_id, result in results:
        function_responses.append({
            "type": "function_result",
            "name": name,
            "call_id": call_id,
            "result": [
                {
                    "type": "text",
                    "text": json.dumps({"url": current_url, **result})
                },
                {
                    "type": "image",
                    "data": base64.b64encode(screenshot_bytes).decode("utf-8"),
                    "mime_type": "image/png"
                }
            ]
        })
    return function_responses

JavaScript

async function getFunctionResponses(page, results) {
    const screenshotBuffer = await page.screenshot({ type: 'png' });
    const screenshotBase64 = screenshotBuffer.toString('base64');
    const currentUrl = page.url();
    const functionResponses = [];

    for (const [name, callId, result] of results) {
        functionResponses.push({
            type: "function_result",
            name: name,
            call_id: callId,
            result: [
                {
                    type: "text",
                    text: JSON.stringify({ url: currentUrl, ...result })
                },
                {
                    type: "image",
                    data: screenshotBase64,
                    mime_type: "image/png"
                }
            ]
        });
    }
    return functionResponses;
}

Once you have defined how to capture and format the environment state, you can combine all these steps into a continuous execution loop.

Build an agent loop

To enable multi-step interactions, combine the four steps from the How to implement Computer Use section into a single loop. This loop continues requesting actions and feeding the results back to the model until the task is complete.

Remember to manage the conversation history correctly by appending both the model responses and your function responses to the history at each step.

Python

import time
from typing import Any, List, Tuple
from playwright.sync_api import sync_playwright
from google import genai
from google.genai import types

client = genai.Client()

SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

print("Initializing browser...")
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT})
page = context.new_page()

# Paste helper functions execute_function_calls and get_function_responses here

try:
    page.goto("https://ai.google.dev/gemini-api/docs")

    config = types.GenerateContentConfig(
        tools=[types.Tool(computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER,
            enable_prompt_injection_detection=True
        ))],
        thinking_config=types.ThinkingConfig(include_thoughts=True),
    )

    initial_screenshot = page.screenshot(type="png")
    USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing."
    print(f"Goal: {USER_PROMPT}")

    contents = [
        types.Content(role="user", parts=[
            types.Part(text=USER_PROMPT),
            types.Part.from_bytes(data=initial_screenshot, mime_type='image/png')
        ])
    ]

    # Agent Loop
    turn_limit = 5
    for i in range(turn_limit):
        print(f"\n--- Turn {i+1} ---")
        print("Thinking...")
        response = client.models.generate_content(
            model='gemini-3.5-flash',
            contents=contents,
            config=config,
        )

        candidate = response.candidates[0]
        contents.append(candidate.content)

        has_function_calls = any(part.function_call for part in candidate.content.parts)
        if not has_function_calls:
            text_response = " ".join(
                part.text for part in candidate.content.parts if hasattr(part, 'text')
            )
            print("Agent finished:", text_response)
            break

        print("Executing actions...")
        results = execute_function_calls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT)

        print("Capturing state...")
        function_responses = get_function_responses(page, results)

        contents.append(
            types.Content(role="user", parts=[types.Part(function_response=fr) for fr in function_responses])
        )

finally:
    print("Closing browser...")
    browser.close()
    playwright.stop()

JavaScript

import { chromium } from 'playwright';
import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

// Constants for screen dimensions
const SCREEN_WIDTH = 1440;
const SCREEN_HEIGHT = 900;

console.log("Initializing browser...");
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext({
    viewport: { width: SCREEN_WIDTH, height: SCREEN_HEIGHT }
});
const page = await context.newPage();

// Define helper functions. Copy/paste from steps 3 and 4:
// function denormalizeX(...)
// function denormalizeY(...)
// async function executeFunctionCalls(...)
// async function getFunctionResponses(...)

try {
    await page.goto("https://ai.google.dev/gemini-api/docs");

    const config = {
        tools: [{
            computerUse: {
                environment: "ENVIRONMENT_BROWSER",
                enable_prompt_injection_detection: true
            }
        }],
        thinkingConfig: { includeThoughts: true }
    };

    const initialScreenshotBuffer = await page.screenshot({ type: 'png' });
    const initialScreenshotBase64 = initialScreenshotBuffer.toString('base64');
    const USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing.";
    console.log(`Goal: ${USER_PROMPT}`);

    const contents = [
        {
            role: "user",
            parts: [
                { text: USER_PROMPT },
                {
                    inlineData: {
                        data: initialScreenshotBase64,
                        mimeType: "image/png"
                    }
                }
            ]
        }
    ];

    // Agent Loop
    const turnLimit = 5;
    for (let i = 0; i < turnLimit; i++) {
        console.log(`\n--- Turn ${i + 1} ---`);
        console.log("Thinking...");
        const response = await ai.models.generateContent({
            model: 'gemini-3.5-flash',
            contents: contents,
            config: config
        });

        const candidate = response.candidates[0];
        contents.push(candidate.content);

        const hasFunctionCalls = candidate.content.parts.some(part => part.functionCall);
        if (!hasFunctionCalls) {
            const textResponse = candidate.content.parts
                .filter(part => part.text)
                .map(part => part.text)
                .join(" ");
            console.log("Agent finished:", textResponse);
            break;
        }

        console.log("Executing actions...");
        const results = await executeFunctionCalls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT);

        console.log("Capturing state...");
        const functionResponses = await getFunctionResponses(page, results);

        contents.push({
            role: "user",
            parts: functionResponses.map(fr => ({
                ...fr
            }))
        });
    }
} finally {
    console.log("Closing browser...");
    await browser.close();
}

Supported environments (Gemini 3.5 Flash)

Gemini 3.5 Flash supports three environments specified in the computer_use configurations:

Browser environment (ENVIRONMENT_BROWSER)

Action actions under browser tool:

Command name Description Arguments (in function call)
click Left clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
double_click Double clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
triple_click Triple clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
middle_click Middle clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
right_click Right clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
mouse_down Presses and holds the mouse button at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
mouse_up Releases the mouse button at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
move Moves the cursor to the specified position. y: int (0-999)
x: int (0-999)
intent: str
type Types text. text: str
press_enter: bool (Optional, default false)
intent: str
drag_and_drop Drags an item from the start coordinate to the end coordinate. start_y: int (0-999)
start_x: int (0-999)
end_y: int (0-999)
end_x: int (0-999)
intent: str
wait Pauses execution for a specified number of seconds. seconds: int (Optional, default 1)
intent: str
press_key Presses the specified key and releases it. key: str
intent: str
key_down Presses and holds the specified key. key: str
intent: str
key_up Releases the specified key. key: str
intent: str
hotkey Presses the specified key combination. keys: List[str]
intent: str
take_screenshot Returns a screenshot of the current screen. intent: str
scroll Scrolls up, down, left, or right at a coordinate by a pixel distance. y: int (0-999)
x: int (0-999)
direction: str ("up", "down", "left", "right")
magnitude_in_pixels: int (0-999, Optional, default 300)
intent: str
go_back Navigates back to the previous webpage in the browser history. intent: str
navigate Navigates directly to a specified URL. url: str
intent: str
go_forward Navigates forward to the next webpage in the browser history. intent: str

Mobile environment (ENVIRONMENT_MOBILE)

Android-optimized environment actions:

Command name Description Arguments (in function call)
open_app Opens an application by its name. app_name: str
intent: str
click Left clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
list_apps Lists available applications on the device, returning their names and package names. intent: str
wait Pauses execution for a specified number of seconds. seconds: int (Optional, default 1)
intent: str
go_back Navigates back to the previous screen or webpage. intent: str
type Types text. text: str
press_enter: bool (Optional, default false)
intent: str
drag_and_drop Drags an item from the start coordinate to the end coordinate. start_y: int (0-999)
start_x: int (0-999)
end_y: int (0-999)
end_x: int (0-999)
intent: str
long_press Performs a long press at a coordinate on the screen. y: int (0-999)
x: int (0-999)
seconds: int (Optional, default 2)
intent: str
press_key Presses the specified key and releases it. key: str
intent: str
take_screenshot Returns a screenshot of the current screen. intent: str

Desktop environment (ENVIRONMENT_DESKTOP)

Desktop environments OS-level cursor commands:

Command name Description Arguments (in function call)
click Left clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
double_click Double clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
triple_click Triple clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
middle_click Middle clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
right_click Right clicks at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
mouse_down Presses and holds the mouse button at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
mouse_up Releases the mouse button at the coordinate. y: int (0-999)
x: int (0-999)
intent: str
move Moves the cursor to the specified position. y: int (0-999)
x: int (0-999)
intent: str
type Types text. text: str
press_enter: bool (Optional, default false)
intent: str
drag_and_drop Drags an item from the start coordinate to the end coordinate. start_y: int (0-999)
start_x: int (0-999)
end_y: int (0-999)
end_x: int (0-999)
intent: str
wait Pauses execution for a specified number of seconds. seconds: int (Optional, default 1)
intent: str
press_key Presses the specified key and releases it. key: str
intent: str
key_down Presses and holds the specified key. key: str
intent: str
key_up Releases the specified key. key: str
intent: str
hotkey Presses the specified key combination. keys: List[str]
intent: str
take_screenshot Returns a screenshot of the current screen. intent: str
scroll Scrolls up, down, left, or right at a coordinate by a pixel distance. y: int (0-999)
x: int (0-999)
direction: str ("up", "down", "left", "right")
magnitude_in_pixels: int (0-999, Optional, default 300)
intent: str

Legacy Supported UI actions (Gemini 2.5)

For legacy models (gemini-2.5-computer-use-preview-10-2025), the following actions are supported:

Command name Description Arguments (in function call) Example function call
open_web_browser Opens the web browser. None {"name": "open_web_browser", "args": {}}
wait_5_seconds Pauses execution for 5 seconds. None {"name": "wait_5_seconds", "args": {}}
go_back Navigates to the previous page in history. None {"name": "go_back", "args": {}}
go_forward Navigates to the next page in history. None {"name": "go_forward", "args": {}}
search Navigates to default search engine. None {"name": "search", "args": {}}
navigate Navigates the browser directly to the specified URL. url: str {"name": "navigate", "args": {"url": "https://www.wikipedia.org"}}
click_at Clicks at a specific coordinate. y: int (0-999), x: int (0-999) {"name": "click_at", "args": {"y": 300, "x": 500}}
hover_at Hovers mouse at a specific coordinate. y: int (0-999), x: int (0-999) {"name": "hover_at", "args": {"y": 150, "x": 250}}
type_text_at Types text at a coordinate. y: int (0-999), x: int (0-999), text: str, press_enter: bool (Optional, default True), clear_before_typing: bool (Optional, default True) {"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search", "press_enter": false}}
key_combination Press keys or combinations. keys: str {"name": "key_combination", "args": {"keys": "Control+A"}}
scroll_document Scrolls the entire webpage. direction: str {"name": "scroll_document", "args": {"direction": "down"}}
scroll_at Scrolls at coordinate (x,y). y: int, x: int, direction: str, magnitude: int (Optional, default 800) {"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down"}}
drag_and_drop Drags between two coordinates. y: int, x: int, destination_y: int, destination_x: int {"name": "drag_and_drop", "args": {"y": 100, "destination_y": 500, "destination_x": 500, "x": 100}}

Custom user-defined functions

You can extend the functionality of the model by including custom user-defined functions. For example, in human-in-the-loop (HITL) scenarios you can exclude default predefined actions and register custom actions.

Gemini 3.5 Flash Custom Tooling

Python

Exclude standard predefined browser actions (such as click) and register a custom yield_to_user tool:

from google import genai
from google.genai import types

client = genai.Client()

yield_to_user_tool = types.FunctionDeclaration(
    name="yield_to_user",
    description="Yields control back to the user for assistance or verification when an automated action is unsafe or ambiguous.",
    parameters=types.Schema(
        type="OBJECT",
        properties={
            "reason": types.Schema(
                type="STRING",
                description="The reason why the agent is yielding control to the human."
            )
        },
        required=["reason"]
    )
)

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Click the submit button. If you need a second factor authentication code, ask me.",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment="ENVIRONMENT_MOBILE",
                    excluded_predefined_functions=["click"]
                )
            ),
            yield_to_user_tool
        ]
    )
)

Gemini 2.5 (Legacy) Custom Tooling

Python

from typing import Optional, Dict, Any
from google import genai
from google.genai import types

client = genai.Client()

# Define custom tools here
custom_functions = [...] # Describe parameters as FunctionDeclaration object

def make_generate_content_config():
    excluded_functions = ["open_web_browser", "wait_5_seconds", "go_back", "go_forward", "search", "navigate", "hover_at", "scroll_document", "key_combination", "drag_and_drop"]
    generate_content_config = types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER,
                    excluded_predefined_functions=excluded_functions
                )
            ),
            types.Tool(function_declarations=custom_functions)
        ]
    )
    return generate_content_config

Managing thinking levels (Gemini 3.5 Flash)

For computer use agents, you can configure different thinking levels to balance action quality and execution speed. Lower thinking levels generally achieve a good balance for standard automation tasks.

Safety and security

Configuring safety policies (Gemini 3.5 Flash)

The Gemini 3.5 Flash model includes a built-in safety service categories that automatically determine if user confirmation is required.

Safety policy category Description
FINANCIAL_TRANSACTIONS Blocks or triggers confirmation for actions involving payments, retail checkout, or regulated goods.
SENSITIVE_DATA_MODIFICATION Protects health, financial, or government records from unauthorized modification.
COMMUNICATION_TOOL Restricts the agent from autonomously sending emails, chat messages, or drafts.
ACCOUNT_CREATION Restricts the agent from autonomously registering new accounts on websites.
DATA_MODIFICATION Regulates overall file system modifications, data sharing, and storage deletion.
USER_CONSENT_MANAGEMENT Requires user takeover for cookie consent banners and privacy prompts.
LEGAL_TERMS_AND_AGREEMENTS Prevents the model from autonomously accepting Terms of Service or legally binding contracts.

Safety overrides

You can override select policies by passing overrides:

Python

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Clean up the local folder by archiving old logs.",
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_DESKTOP,
                    safety_policy_overrides=[
                        types.SafetyPolicyOverride(category="DATA_MODIFICATION")
                    ]
                )
            )
        ]
    )
)

JavaScript

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI();

const response = await ai.models.generateContent({
  model: 'gemini-3.5-flash',
  contents: "Clean up the local folder by archiving old logs.",
  config: {
    tools: [{
      computerUse: {
        environment: "ENVIRONMENT_DESKTOP",
        safety_policy_overrides: [
          { category: "DATA_MODIFICATION" }
        ]
      }
    }]
  }
});

Prompt injection detection (Gemini 3.5 Flash)

Opt-in safety mechanism that scans screenshot pixels for hidden adversarial prompt instructions (e.g. "Ignore previous commands") and blocks execution when detected.

Acknowledge safety decision (Gemini 2.5 Legacy)

For legacy models, the response may include a safety_decision parameter:

{
  "function_call": {
    "name": "click_at",
    "args": {
      "x": 60,
      "y": 100,
      "safety_decision": {
        "explanation": "Must check check-box",
        "decision": "require_confirmation"
      }
    }
  }
}

If the safety_decision is require_confirmation, prompt the end user. If the user confirms, set safety_acknowledgement in FunctionResponse.

Python

def get_safety_confirmation(safety_decision):
    # Prompt user
    return "CONTINUE" # Or TERMINATE

# Inside execute_function_calls:
if 'safety_decision' in function_call.args:
    decision = get_safety_confirmation(function_call.args['safety_decision'])
    if decision == "TERMINATE":
        break
    extra_fr_fields["safety_acknowledgement"] = "true"

Safety best practices

Computer Use presents unique security and operational risks, as a model acting on a user's behalf might encounter untrusted content on screens or make errors in executing actions. Implement the following best practices to protect user data and systems:

  1. Human-in-the-Loop (HITL):

    • Enforce user confirmation: When the safety response indicates require_confirmation (or legacy safety decision requires it), prompt the user for approval.
    • Provide custom safety instructions: Implement a custom system instruction to define and enforce your own safety boundaries. For example:

      Python

      from google import genai
      from google.genai import types
      
      system_instruction = """
      ## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**
      
      This is your first and most important check. If the next required action falls
      into any of the following categories, you MUST stop immediately, and seek the
      user's explicit permission.
      
      **Procedure for Seeking Confirmation:**
      * **For Consequential Actions:** Perform all preparatory steps (e.g., navigating,
        filling out forms, typing a message). You will ask for confirmation **AFTER**
        all necessary information is entered on the screen, but **BEFORE** you perform
        the final, irreversible action (e.g., before clicking "Send", "Submit",
        "Confirm Purchase", "Share").
      * **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting
        legal terms, solving a CAPTCHA), you must first inform the user about the
        required action and ask for their confirmation to proceed.
      
      **USER_CONFIRMATION Categories:**
      
      *   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
          agreeing to any of the following on the user's behalf. You must ask the
          user to confirm before performing these actions.
          *   Terms of Service
          *   Privacy Policies
          *   Cookie consent banners
          *   End User License Agreements (EULAs)
          *   Any other legally significant contracts or agreements.
      *   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
          following. You must ask the user to confirm before performing these actions.
          *   CAPTCHAs (of any kind)
          *   Any other anti-robot or human-verification mechanisms, even if you are
              capable.
      *   **Financial Transactions:**
          *   Completing any purchase.
          *   Managing or moving money (e.g., transfers, payments).
          *   Purchasing regulated goods or participating in gambling.
      *   **Sending Communications:**
          *   Sending emails.
          *   Sending messages on any platform (e.g., social media, chat apps).
          *   Posting content on social media or forums.
      *   **Accessing or Modifying Sensitive Information:**
          *   Health, financial, or government records (e.g., medical history, tax
              forms, passport status).
          *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
              account number, credit card number).
      *   **User Data Management:**
          *   Accessing, downloading, or saving files from the web.
          *   Sharing or sending files/data to any third party.
          *   Transferring user data between systems.
      *   **Browser Data Usage:**
          *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
              or saved passwords.
      *   **Security and Identity:**
          *   Logging into any user account.
          *   Any action that involves misrepresentation or impersonation (e.g.,
              creating a fan account, posting as someone else).
      *   **Insurmountable Obstacles:** If you are technically unable to interact with
          a user interface element or are stuck in a loop you cannot resolve, ask the
          user to take over.
      ---
      
      ## **RULE 2: Default Behavior (ACTUATE)**
      
      If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
      your default behavior is to **Actuate**.
      
      **Actuation Means:**  You MUST proactively perform all necessary steps to move
      the user's request forward. Continue to actuate until you either complete the
      non-consequential task or encounter a condition defined in Rule 1.
      
      *   **Example 1:** If asked to send money, you will navigate to the payment
          portal, enter the recipient's details, and enter the amount. You will then
          **STOP** as per Rule 1 and ask for confirmation before clicking the final
          "Send" button.
      *   **Example 2:** If asked to post a message, you will navigate to the site,
          open the post composition window, and write the full message. You will then
          **STOP** as per Rule 1 and ask for confirmation before clicking the final
          "Post" button.
      
          After the user has confirmed, remember to get the user's latest screen
          before continuing to perform actions.
      
      # Final Response Guidelines:
      Write final response to the user in the following cases:
      - User confirmation
      - When the task is complete or you have enough information to respond to the user
      """
      
      client = genai.Client()
      response = client.models.generate_content(
          model="gemini-3.5-flash",
          contents="Prepare a draft but do not send.",
          config=types.GenerateContentConfig(
              system_instruction=system_instruction,
              tools=[types.Tool(computer_use=types.ComputerUse(environment="ENVIRONMENT_BROWSER"))]
          )
      )
      

      JavaScript

      import { GoogleGenAI } from '@google/genai';
      
      const ai = new GoogleGenAI();
      
      const systemInstruction = `
      ## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**
      
      This is your first and most important check. If the next required action falls
      into any of the following categories, you MUST stop immediately, and seek the
      user's explicit permission.
      
      **Procedure for Seeking Confirmation:**
      * **For Consequential Actions:** Perform all preparatory steps (e.g., navigating,
        filling out forms, typing a message). You will ask for confirmation **AFTER**
        all necessary information is entered on the screen, but **BEFORE** you perform
        the final, irreversible action (e.g., before clicking "Send", "Submit",
        "Confirm Purchase", "Share").
      * **For Prohibited Actions:** If the action is strictly forbidden (e.g., accepting
        legal terms, solving a CAPTCHA), you must first inform the user about the
        required action and ask for their confirmation to proceed.
      
      **USER_CONFIRMATION Categories:**
      
      *   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
          agreeing to any of the following on the user's behalf. You must ask the
          user to confirm before performing these actions.
          *   Terms of Service
          *   Privacy Policies
          *   Cookie consent banners
          *   End User License Agreements (EULAs)
          *   Any other legally significant contracts or agreements.
      *   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
          following. You must ask the user to confirm before performing these actions.
          *   CAPTCHAs (of any kind)
          *   Any other anti-robot or human-verification mechanisms, even if you are
              capable.
      *   **Financial Transactions:**
          *   Compleying any purchase.
          *   Managing or moving money (e.g., transfers, payments).
          *   Purchasing regulated goods or participating in gambling.
      *   **Sending Communications:**
          *   Sending emails.
          *   Sending messages on any platform (e.g., social media, chat apps).
          *   Posting content on social media or forums.
      *   **Accessing or Modifying Sensitive Information:**
          *   Health, financial, or government records (e.g., medical history, tax
              forms, passport status).
          *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
              account number, credit card number).
      *   **User Data Management:**
          *   Accessing, downloading, or saving files from the web.
          *   Sharing or sending files/data to any third party.
          *   Transferring user data between systems.
      *   **Browser Data Usage:**
          *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
              or saved passwords.
      *   **Security and Identity:**
          *   Logging into any user account.
          *   Any action that involves misrepresentation or impersonation (e.g.,
              creating a fan account, posting as someone else).
      *   **Insurmountable Obstacles:** If you are technically unable to interact with
          a user interface element or are stuck in a loop you cannot resolve, ask the
          user to take over.
      ---
      
      ## **RULE 2: Default Behavior (ACTUATE)**
      
      If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
      your default behavior is to **Actuate**.
      
      **Actuation Means:**  You MUST proactively perform all necessary steps to move
      the user's request forward. Continue to actuate until you either complete the
      non-consequential task or encounter a condition defined in Rule 1.
      
      *   **Example 1:** If asked to send money, you will navigate to the payment
          portal, enter the recipient's details, and enter the amount. You will then
          **STOP** as per Rule 1 and ask for confirmation before clicking the final
          "Send" button.
      *   **Example 2:** If asked to post a message, you will navigate to the site,
          open the post composition window, and write the full message. You will then
          **STOP** as per Rule 1 and ask for confirmation before clicking the final
          "Post" button.
      
          After the user has confirmed, remember to get the user's latest screen
          before continuing to perform actions.
      
      # Final Response Guidelines:
      Write final response to the user in the following cases:
      - User confirmation
      - When the task is complete or you have enough information to respond to the user
      `;
      
      const response = await ai.models.generateContent({
        model: 'gemini-3.5-flash',
        contents: "Prepare a draft but do not send.",
        config: {
          systemInstruction: systemInstruction,
          tools: [{
            computerUse: {
              environment: "ENVIRONMENT_BROWSER"
            }
          }]
        }
      });
      
  2. Secure execution environment: Run your agent in a secure, sandboxed environment to limit its potential impact. This can be a sandboxed virtual machine (VM), a container (e.g., Docker), or a dedicated browser profile with limited permissions. See the GitHub reference implementation for sandbox setup guidance using Docker.

  3. Input sanitization: Sanitize all user-generated text in prompts to mitigate the risk of unintended instructions or prompt injection. This is a helpful layer of security, but not a replacement for a secure execution environment.

  4. Content guardrails: Use guardrails and content safety APIs to evaluate user inputs, tool inputs and outputs, and the agent's responses for appropriateness, prompt injection, and jailbreak detection.

  5. Allowlists and blocklists: Implement filtering mechanisms to control where the model can navigate and what it can do. A blocklist of prohibited websites is a good starting point, while a more restrictive allowlist is even more secure.

  6. Observability and logging: Maintain detailed logs for debugging, auditing, and incident response. Your client should log prompts, screenshots, model-suggested actions (function_call), safety responses, and all actions ultimately executed by the client.

  7. Environment management: Ensure the GUI environment is consistent. Unexpected pop-ups, notifications, or changes in layout can confuse the model. Start from a known, clean state for each new task if possible.

Model versions

You can use Computer Use with the following models:

  • Gemini 3.5 Flash (gemini-3.5-flash): The recommended model for computer use, featuring streamlined actions with intents, support for browser, mobile, and desktop environments, configurable safety policies, and prompt injection detection.
  • Gemini 3 Flash Preview (gemini-3-flash-preview): Preview model supporting computer use.
  • Gemini 2.5 (Legacy Preview) (gemini-2.5-computer-use-preview-10-2025): Legacy preview model optimized for browser-based computer use.

What's next