การใช้คอมพิวเตอร์

การใช้คอมพิวเตอร์ช่วยให้คุณสร้างเอเจนต์ควบคุมเบราว์เซอร์ที่โต้ตอบ และทำงานโดยอัตโนมัติได้ การใช้ภาพหน้าจอทำให้โมเดล "เห็น" หน้าจอคอมพิวเตอร์และ "ดำเนินการ" โดยสร้างการดำเนินการ UI ที่เฉพาะเจาะจง เช่น การคลิกเมาส์และอินพุตจากคีย์บอร์ด คุณต้องเขียนโค้ดแอปพลิเคชันฝั่งไคลเอ็นต์เพื่อรับและดำเนินการกับการดำเนินการเกี่ยวกับการใช้คอมพิวเตอร์ ซึ่งคล้ายกับการเรียกใช้ฟังก์ชัน

การใช้คอมพิวเตอร์ช่วยให้คุณสร้างเอเจนต์ที่ทำสิ่งต่อไปนี้ได้

ทำให้การกรอกข้อมูลหรือการกรอกแบบฟอร์มซ้ำๆ ในเว็บไซต์เป็นแบบอัตโนมัติ
ทำการทดสอบเว็บแอปพลิเคชันและโฟลว์ของผู้ใช้โดยอัตโนมัติ
ทําการวิจัยในเว็บไซต์ต่างๆ (เช่น รวบรวมข้อมูลผลิตภัณฑ์ ราคา และรีวิวจากเว็บไซต์อีคอมเมิร์ซเพื่อประกอบการตัดสินใจซื้อ)

วิธีที่ง่ายที่สุดในการทดสอบความสามารถในการใช้คอมพิวเตอร์คือการทดสอบผ่านการติดตั้งใช้งานอ้างอิงหรือสภาพแวดล้อมเดโมของ Browserbase

หมายเหตุ: การใช้งานคอมพิวเตอร์อาจมีแนวโน้มที่จะเกิดข้อผิดพลาดและช่องโหว่ด้านความปลอดภัย เนื่องจากเป็นความสามารถในเวอร์ชันตัวอย่าง เราขอแนะนำให้ควบคุมดูแลอย่างใกล้ชิดสำหรับงานที่สำคัญ และหลีกเลี่ยงการใช้ความสามารถในการใช้คอมพิวเตอร์สำหรับงานที่ต้องตัดสินใจเรื่องสำคัญ ข้อมูลที่ละเอียดอ่อน หรือการดำเนินการที่ไม่สามารถแก้ไขข้อผิดพลาดร้ายแรงได้ เราขอแนะนำให้คุณอ่านแนวทางปฏิบัติ แนะนำด้านความปลอดภัย นโยบายการใช้งาน ที่ไม่อนุญาต และข้อกำหนดในการให้บริการเพิ่มเติมของ Gemini API

การทำงานของการใช้คอมพิวเตอร์

หากต้องการสร้างเอเจนต์ควบคุมเบราว์เซอร์ด้วยโมเดลการใช้คอมพิวเตอร์ ให้ใช้ ลูปเอเจนต์ที่ทำสิ่งต่อไปนี้

ส่งคำขอไปยังโมเดล
- เพิ่มเครื่องมือการใช้คอมพิวเตอร์และฟังก์ชันที่ผู้ใช้กำหนดเองหรือฟังก์ชันที่ยกเว้น (ไม่บังคับ) ลงในคำขอ API
- ป้อนพรอมต์โมเดลการใช้คอมพิวเตอร์ด้วยคำขอของผู้ใช้
รับคำตอบของโมเดล
- โมเดลการใช้คอมพิวเตอร์จะวิเคราะห์คำขอและภาพหน้าจอของผู้ใช้ แล้ว สร้างคำตอบซึ่งมีfunction_callที่แนะนำ ซึ่งแสดงถึงการดำเนินการใน UI (เช่น "คลิกที่พิกัด (x,y)" หรือ "พิมพ์ 'ข้อความ'") ดูคำอธิบายการดำเนินการทั้งหมดใน UI ที่โมเดล Computer Use รองรับได้ที่การดำเนินการที่รองรับ
- การตอบกลับจาก API อาจมี safety_decision จากระบบความปลอดภัยภายใน ที่ตรวจสอบการดำเนินการที่โมเดลเสนอด้วย ซึ่ง safety_decisionจะจัดประเภทการดำเนินการเป็น
  - ปกติ / อนุญาต: ระบบจะถือว่าการดำเนินการปลอดภัย นอกจากนี้ ยังอาจ แสดงด้วยการไม่มีsafety_decision
  - ต้องมีการยืนยัน (require_confirmation): โมเดลกำลังจะดำเนินการ ที่อาจมีความเสี่ยง (เช่น การคลิก "แบนเนอร์คุกกี้ยอมรับ")
ดำเนินการตามการกระทำที่ได้รับ
- โค้ดฝั่งไคลเอ็นต์จะได้รับ function_call และ safety_decision ที่เกี่ยวข้อง
  - ปกติ / อนุญาต: หาก safety_decision ระบุว่าปกติ/อนุญาต (หรือหากไม่มี safety_decision) โค้ดฝั่งไคลเอ็นต์จะเรียกใช้ function_call ที่ระบุในสภาพแวดล้อมเป้าหมายได้ (เช่น เว็บเบราว์เซอร์)
  - ต้องมีการยืนยัน: หาก safety_decision ระบุว่า ต้องมีการยืนยัน แอปพลิเคชันของคุณต้องแจ้งให้ผู้ใช้ปลายทาง ยืนยันก่อนที่จะดำเนินการ function_call หากผู้ใช้ ยืนยัน ให้ดำเนินการตามที่ขอ หากผู้ใช้ปฏิเสธ ให้ อย่าดำเนินการ
บันทึกสถานะสภาพแวดล้อมใหม่
- หากดำเนินการแล้ว ไคลเอ็นต์จะจับภาพหน้าจอใหม่ของ GUI และ URL ปัจจุบันเพื่อส่งกลับไปยังโมเดลการใช้คอมพิวเตอร์เป็นส่วนหนึ่งของ function_response
- หากระบบความปลอดภัยบล็อกการดำเนินการหรือผู้ใช้ปฏิเสธการยืนยัน แอปพลิเคชันของคุณอาจส่งความคิดเห็นในรูปแบบอื่นไปยังโมเดลหรือสิ้นสุดการโต้ตอบ

กระบวนการนี้จะทำซ้ำจากขั้นตอนที่ 2 โดยโมเดลจะใช้ภาพหน้าจอใหม่ และเป้าหมายที่กำลังดำเนินการเพื่อแนะนำการดำเนินการถัดไป ลูปจะทำงานต่อไป จนกว่าจะทำงานเสร็จ เกิดข้อผิดพลาด หรือกระบวนการสิ้นสุดลง (เช่น เนื่องจากมีการตอบสนองด้านความปลอดภัยแบบ "บล็อก" หรือผู้ใช้ตัดสินใจ)

ภาพรวม
การใช้คอมพิวเตอร์

วิธีใช้งานการใช้คอมพิวเตอร์

ก่อนที่จะสร้างด้วยเครื่องมือการใช้งานคอมพิวเตอร์ คุณจะต้องตั้งค่าสิ่งต่อไปนี้

สภาพแวดล้อมการดำเนินการที่ปลอดภัย: เพื่อความปลอดภัย คุณควรเรียกใช้เอเจนต์การใช้งานคอมพิวเตอร์ในสภาพแวดล้อมที่ปลอดภัยและมีการควบคุม (เช่น เครื่องเสมือนแบบแซนด์บ็อกซ์ คอนเทนเนอร์ หรือโปรไฟล์เบราว์เซอร์เฉพาะที่มีสิทธิ์จำกัด)
ตัวแฮนเดิลการดำเนินการฝั่งไคลเอ็นต์: คุณจะต้องใช้ตรรกะฝั่งไคลเอ็นต์ เพื่อดำเนินการที่โมเดลสร้างขึ้น และ บันทึกภาพหน้าจอของสภาพแวดล้อมหลังจากการดำเนินการแต่ละครั้ง

ตัวอย่างในส่วนนี้ใช้เบราว์เซอร์เป็นสภาพแวดล้อมการดำเนินการ และ Playwright เป็นตัวแฮนเดิลการดำเนินการฝั่งไคลเอ็นต์ หากต้องการ เรียกใช้ตัวอย่างเหล่านี้ คุณต้องติดตั้งการขึ้นต่อกันที่จำเป็นและเริ่มต้นอินสแตนซ์เบราว์เซอร์ Playwright

ติดตั้ง Playwright

    pip install google-genai playwright
    playwright install chromium

เริ่มต้นอินสแตนซ์เบราว์เซอร์ Playwright

    from playwright.sync_api import sync_playwright

    # 1. Configure screen dimensions for the target environment
    SCREEN_WIDTH = 1440
    SCREEN_HEIGHT = 900

    # 2. Start the Playwright browser
    # In production, utilize a sandboxed environment.
    playwright = sync_playwright().start()
    # Set headless=False to see the actions performed on your screen
    browser = playwright.chromium.launch(headless=False)

    # 3. Create a context and page with the specified dimensions
    context = browser.new_context(
        viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
    )
    page = context.new_page()

    # 4. Navigate to an initial page to start the task
    page.goto("https://www.google.com")

    # The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
    # will be used in the steps below.

โค้ดตัวอย่างสำหรับการขยายไปยังสภาพแวดล้อม Android จะรวมอยู่ในส่วนการใช้ฟังก์ชันที่ผู้ใช้กำหนดเอง

1. ส่งคำขอไปยังโมเดล

เพิ่มเครื่องมือการใช้คอมพิวเตอร์ลงในคำขอ API และส่งพรอมต์ไปยังโมเดล ซึ่งมีเป้าหมายของผู้ใช้ คุณต้องใช้โมเดลที่รองรับการใช้งานคอมพิวเตอร์ หรือจะได้รับข้อผิดพลาด

gemini-2.5-computer-use-preview-10-2025
gemini-3-pro-preview
gemini-3-flash-preview

นอกจากนี้ คุณยังเพิ่มพารามิเตอร์ต่อไปนี้ได้ด้วย (ไม่บังคับ)

การดำเนินการที่ยกเว้น: หากมีการดำเนินการใดๆ จากรายการการดำเนินการ UI ที่รองรับที่คุณไม่ต้องการให้โมเดลดำเนินการ ให้ระบุการดำเนินการเหล่านี้เป็น excluded_predefined_functions
ฟังก์ชันที่ผู้ใช้กำหนด: นอกเหนือจากเครื่องมือการใช้งานคอมพิวเตอร์แล้ว คุณอาจต้องการรวมฟังก์ชันที่ผู้ใช้กำหนดเองด้วย

โปรดทราบว่าไม่จำเป็นต้องระบุขนาดการแสดงผลเมื่อส่งคำขอ เนื่องจากโมเดลจะคาดการณ์พิกัดพิกเซลที่ปรับขนาดตามความสูงและความกว้างของ หน้าจอ

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]

generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        # 1. Computer Use tool with browser environment
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                # Optional: Exclude specific predefined functions
                excluded_predefined_functions=excluded_functions
                )
              ),
        # 2. Optional: Custom user-defined functions
        #types.Tool(
          # function_declarations=custom_functions
          #   )
          ],
  )

# Create the content with user message
contents=[
    Content(
        role="user",
        parts=[
            Part(text="Search for highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping. Create a bulleted list of the 3 cheapest options in the format of name, description, price in an easy-to-read layout."),
        ],
    )
]

# Generate content with the configured settings
response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=contents,
    config=generate_content_config,
)

# Print the response output
print(response)

ดูตัวอย่างฟังก์ชันที่กำหนดเองได้ที่การใช้ฟังก์ชันที่ผู้ใช้กำหนดเอง

2. รับคำตอบของโมเดล

เมื่อเปิดใช้เครื่องมือการใช้คอมพิวเตอร์ โมเดลจะตอบกลับด้วยFunctionCallsอย่างน้อย 1 รายการ หากพิจารณาแล้วว่าต้องมีการดำเนินการใน UI เพื่อทำงานให้เสร็จสมบูรณ์ การใช้คอมพิวเตอร์รองรับการเรียกใช้ฟังก์ชันแบบขนาน ซึ่งหมายความว่าโมเดลสามารถแสดงผลการดำเนินการหลายอย่างในเทิร์นเดียว

ตัวอย่างคำตอบของโมเดลมีดังนี้

{
  "content": {
    "parts": [
      {
        "text": "I will type the search query into the search bar. The search bar is in the center of the page."
      },
      {
        "function_call": {
          "name": "type_text_at",
          "args": {
            "x": 371,
            "y": 470,
            "text": "highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping",
            "press_enter": true
          }
        }
      }
    ]
  }
}

3. ดำเนินการตามการดำเนินการที่ได้รับ

โค้ดแอปพลิเคชันของคุณต้องแยกวิเคราะห์การตอบกลับของโมเดล ดำเนินการ และรวบรวมผลลัพธ์

โค้ดตัวอย่างด้านล่างจะดึงการเรียกใช้ฟังก์ชันจากคำตอบของโมเดลการใช้งานคอมพิวเตอร์ และแปลเป็นคำสั่งที่เรียกใช้ด้วย Playwright ได้ โมเดลจะแสดงผลพิกัดที่ปรับให้เป็นมาตรฐาน (0-999) โดยไม่คำนึงถึงขนาดของรูปภาพอินพุต ดังนั้นขั้นตอนการแปลบางส่วนจึงเป็นการแปลงพิกัดที่ปรับให้เป็นมาตรฐานเหล่านี้ กลับเป็นค่าพิกเซลจริง

ขนาดหน้าจอที่แนะนำสำหรับการใช้กับโมเดลการใช้คอมพิวเตอร์คือ (1440, 900) โมเดลจะทำงานกับความละเอียดใดก็ได้ แม้ว่าคุณภาพของผลลัพธ์อาจได้รับผลกระทบ

โปรดทราบว่าตัวอย่างนี้รวมเฉพาะการติดตั้งใช้งานสำหรับการดำเนินการ UI ที่พบบ่อยที่สุด 3 รายการ ได้แก่ open_web_browser, click_at และ type_text_at สำหรับ Use Case ในการใช้งานจริง คุณจะต้องใช้การดำเนินการอื่นๆ ทั้งหมดใน UI จากรายการการดำเนินการที่รองรับ เว้นแต่คุณจะเพิ่มการดำเนินการเหล่านั้นเป็นexcluded_predefined_functionsอย่างชัดแจ้ง

Python

from typing import Any, List, Tuple
import time

def denormalize_x(x: int, screen_width: int) -> int:
    """Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
    return int(x / 1000 * screen_width)

def denormalize_y(y: int, screen_height: int) -> int:
    """Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
    return int(y / 1000 * screen_height)

def execute_function_calls(candidate, page, screen_width, screen_height):
    results = []
    function_calls = []
    for part in candidate.content.parts:
        if part.function_call:
            function_calls.append(part.function_call)

    for function_call in function_calls:
        action_result = {}
        fname = function_call.name
        args = function_call.args
        print(f"  -> Executing: {fname}")

        try:
            if fname == "open_web_browser":
                pass # Already open
            elif fname == "click_at":
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)
                page.mouse.click(actual_x, actual_y)
            elif fname == "type_text_at":
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)
                text = args["text"]
                press_enter = args.get("press_enter", False)

                page.mouse.click(actual_x, actual_y)
                # Simple clear (Command+A, Backspace for Mac)
                page.keyboard.press("Meta+A")
                page.keyboard.press("Backspace")
                page.keyboard.type(text)
                if press_enter:
                    page.keyboard.press("Enter")
            else:
                print(f"Warning: Unimplemented or custom function {fname}")

            # Wait for potential navigations/renders
            page.wait_for_load_state(timeout=5000)
            time.sleep(1)

        except Exception as e:
            print(f"Error executing {fname}: {e}")
            action_result = {"error": str(e)}

        results.append((fname, action_result))

    return results

4. บันทึกสถานะสภาพแวดล้อมใหม่

หลังจากดำเนินการแล้ว ให้ส่งผลลัพธ์ของการเรียกใช้ฟังก์ชันกลับไปยังโมเดลเพื่อให้โมเดลใช้ข้อมูลนี้เพื่อสร้างการดำเนินการถัดไปได้ หากมีการดำเนินการหลายอย่าง (การเรียกแบบขนาน) คุณต้องส่ง FunctionResponse สำหรับแต่ละรายการในเทิร์นของผู้ใช้ถัดไป

Python


def get_function_responses(page, results):
    screenshot_bytes = page.screenshot(type="png")
    current_url = page.url
    function_responses = []
    for name, result in results:
        response_data = {"url": current_url}
        response_data.update(result)
        function_responses.append(
            types.FunctionResponse(
                name=name,
                response=response_data,
                parts=[types.FunctionResponsePart(
                        inline_data=types.FunctionResponseBlob(
                            mime_type="image/png",
                            data=screenshot_bytes))
                ]
            )
        )
    return function_responses

สร้างลูปของเอเจนต์

หากต้องการเปิดใช้การโต้ตอบแบบหลายขั้นตอน ให้รวม 4 ขั้นตอนจากส่วนวิธี ใช้คอมพิวเตอร์เป็นลูป อย่าลืมจัดการประวัติการสนทนาอย่างถูกต้องโดยการต่อท้ายทั้งคำตอบของโมเดลและคำตอบของฟังก์ชัน

หากต้องการเรียกใช้โค้ดตัวอย่างนี้ คุณต้องทำดังนี้

ติดตั้งการอ้างอิง Playwright ที่จำเป็น
กำหนดฟังก์ชันตัวช่วยจากขั้นตอน (3) ดำเนินการ การดำเนินการที่ได้รับและ (4) บันทึกสถานะ สภาพแวดล้อมใหม่

Python


import time
from typing import Any, List, Tuple
from playwright.sync_api import sync_playwright

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Constants for screen dimensions
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

# Setup Playwright
print("Initializing browser...")
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT})
page = context.new_page()

# Define helper functions. Copy/paste from steps 3 and 4
# def denormalize_x(...)
# def denormalize_y(...)
# def execute_function_calls(...)
# def get_function_responses(...)

try:
    # Go to initial page
    page.goto("https://ai.google.dev/gemini-api/docs")

    # Configure the model (From Step 1)
    config = types.GenerateContentConfig(
        tools=[types.Tool(computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER
        ))],
        thinking_config=types.ThinkingConfig(include_thoughts=True),
    )

    # Initialize history
    initial_screenshot = page.screenshot(type="png")
    USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing."
    print(f"Goal: {USER_PROMPT}")

    contents = [
        Content(role="user", parts=[
            Part(text=USER_PROMPT),
            Part.from_bytes(data=initial_screenshot, mime_type='image/png')
        ])
    ]

    # Agent Loop
    turn_limit = 5
    for i in range(turn_limit):
        print(f"\n--- Turn {i+1} ---")
        print("Thinking...")
        response = client.models.generate_content(
            model='gemini-2.5-computer-use-preview-10-2025',
            contents=contents,
            config=config,
        )

        candidate = response.candidates[0]
        contents.append(candidate.content)

        has_function_calls = any(part.function_call for part in candidate.content.parts)
        if not has_function_calls:
            text_response = " ".join([part.text for part in candidate.content.parts if part.text])
            print("Agent finished:", text_response)
            break

        print("Executing actions...")
        results = execute_function_calls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT)

        print("Capturing state...")
        function_responses = get_function_responses(page, results)

        contents.append(
            Content(role="user", parts=[Part(function_response=fr) for fr in function_responses])
        )

finally:
    # Cleanup
    print("\nClosing browser...")
    browser.close()
    playwright.stop()

การใช้ฟังก์ชันที่ผู้ใช้กำหนดเอง

คุณอาจรวมฟังก์ชันที่ผู้ใช้กำหนดเองไว้ในคำขอเพื่อขยายฟังก์ชันการทำงานของโมเดลได้ (ไม่บังคับ) ตัวอย่างด้านล่างนี้ปรับรูปแบบและการใช้เครื่องมือการใช้คอมพิวเตอร์ สำหรับกรณีการใช้งานบนอุปกรณ์เคลื่อนที่โดยรวมการดำเนินการที่ผู้ใช้กำหนดเอง เช่น open_app, long_press_at และ go_home ขณะที่ยกเว้น การดำเนินการเฉพาะเบราว์เซอร์ โมเดลสามารถเรียกฟังก์ชันที่กำหนดเองเหล่านี้อย่างชาญฉลาดควบคู่ไปกับการดำเนินการ UI มาตรฐานเพื่อทำงานให้เสร็จสมบูรณ์ในสภาพแวดล้อมที่ไม่ใช่เบราว์เซอร์

Python

from typing import Optional, Dict, Any

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

SYSTEM_PROMPT = """You are operating an Android phone. Today's date is October 15, 2023, so ignore any other date provided.
* To provide an answer to the user, *do not use any tools* and output your answer on a separate line. IMPORTANT: Do not add any formatting or additional punctuation/text, just output the answer by itself after two empty lines.
* Make sure you scroll down to see everything before deciding something isn't available.
* You can open an app from anywhere. The icon doesn't have to currently be on screen.
* Unless explicitly told otherwise, make sure to save any changes you make.
* If text is cut off or incomplete, scroll or click into the element to get the full text before providing an answer.
* IMPORTANT: Complete the given task EXACTLY as stated. DO NOT make any assumptions that completing a similar task is correct.  If you can't find what you're looking for, SCROLL to find it.
* If you want to edit some text, ONLY USE THE `type` tool. Do not use the onscreen keyboard.
* Quick settings shouldn't be used to change settings. Use the Settings app instead.
* The given task may already be completed. If so, there is no need to do anything.
"""

def open_app(app_name: str, intent: Optional[str] = None) -> Dict[str, Any]:
    """Opens an app by name.

    Args:
        app_name: Name of the app to open (any string).
        intent: Optional deep-link or action to pass when launching, if the app supports it.

    Returns:
        JSON payload acknowledging the request (app name and optional intent).
    """
    return {"status": "requested_open", "app_name": app_name, "intent": intent}

def long_press_at(x: int, y: int) -> Dict[str, int]:
    """Long-press at a specific screen coordinate.

    Args:
        x: X coordinate (absolute), scaled to the device screen width (pixels).
        y: Y coordinate (absolute), scaled to the device screen height (pixels).

    Returns:
        Object with the coordinates pressed and the duration used.
    """
    return {"x": x, "y": y}

def go_home() -> Dict[str, str]:
    """Navigates to the device home screen.

    Returns:
        A small acknowledgment payload.
    """
    return {"status": "home_requested"}

#  Build function declarations
CUSTOM_FUNCTION_DECLARATIONS = [
    types.FunctionDeclaration.from_callable(client=client, callable=open_app),
    types.FunctionDeclaration.from_callable(client=client, callable=long_press_at),
    types.FunctionDeclaration.from_callable(client=client, callable=go_home),
]

#Exclude browser functions
EXCLUDED_PREDEFINED_FUNCTIONS = [
    "open_web_browser",
    "search",
    "navigate",
    "hover_at",
    "scroll_document",
    "go_forward",
    "key_combination",
    "drag_and_drop",
]

#Utility function to construct a GenerateContentConfig
def make_generate_content_config() -> genai.types.GenerateContentConfig:
    """Return a fixed GenerateContentConfig with Computer Use + custom functions."""
    return genai.types.GenerateContentConfig(
        system_instruction=SYSTEM_PROMPT,
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER,
                    excluded_predefined_functions=EXCLUDED_PREDEFINED_FUNCTIONS,
                )
            ),
            types.Tool(function_declarations=CUSTOM_FUNCTION_DECLARATIONS),
        ],
    )

# Create the content with user message
contents: list[Content] = [
    Content(
        role="user",
        parts=[
            # text instruction
            Part(text="Open Chrome, then long-press at 200,400."),
        ],
    )
]

# Build your fixed config (from helper)
config = make_generate_content_config()

# Generate content with the configured settings
response = client.models.generate_content(
        model='gemini-2.5-computer-use-preview-10-2025',
        contents=contents,
        config=config,
    )

print(response)

การดำเนินการใน UI ที่รองรับ

โมเดลสามารถขอการดำเนินการใน UI ต่อไปนี้ผ่าน FunctionCall โค้ดฝั่งไคลเอ็นต์ต้องใช้ตรรกะการดำเนินการสําหรับการกระทําเหล่านี้ ดูการติดตั้งใช้งาน อ้างอิงเพื่อดู ตัวอย่าง

ชื่อคำสั่ง	คำอธิบาย	อาร์กิวเมนต์ (ใน Function Call)	ตัวอย่างการเรียกใช้ฟังก์ชัน
open_web_browser	เปิดเว็บเบราว์เซอร์	ไม่มี	`{"name": "open_web_browser", "args": {}}`
wait_5_seconds	หยุดการดำเนินการชั่วคราวเป็นเวลา 5 วินาทีเพื่อให้เนื้อหาแบบไดนามิกโหลดหรือภาพเคลื่อนไหวเสร็จสมบูรณ์	ไม่มี	`{"name": "wait_5_seconds", "args": {}}`
go_back	ไปยังหน้าก่อนหน้าในประวัติของเบราว์เซอร์	ไม่มี	`{"name": "go_back", "args": {}}`
go_forward	ไปยังหน้าถัดไปในประวัติของเบราว์เซอร์	ไม่มี	`{"name": "go_forward", "args": {}}`
search	ไปที่หน้าแรกของเครื่องมือค้นหาเริ่มต้น (เช่น Google) มีประโยชน์ในการเริ่มงานการค้นหาใหม่	ไม่มี	`{"name": "search", "args": {}}`
navigate	นำทางเบราว์เซอร์ไปยัง URL ที่ระบุโดยตรง	`url`: str	`{"name": "navigate", "args": {"url": "https://www.wikipedia.org"}}`
click_at	การคลิกที่พิกัดที่เฉพาะเจาะจงในหน้าเว็บ ค่า x และ y อิงตามตารางกริด 1000x1000 และปรับขนาดให้ตรงกับขนาดหน้าจอ	`y`: int (0-999), `x`: int (0-999)	`{"name": "click_at", "args": {"y": 300, "x": 500}}`
hover_at	วางเมาส์ที่พิกัดที่เฉพาะเจาะจงในหน้าเว็บ มีประโยชน์ในการแสดงเมนูย่อย โดย x และ y จะอิงตามตารางกริดขนาด 1000x1000	`y`: int (0-999) `x`: int (0-999)	`{"name": "hover_at", "args": {"y": 150, "x": 250}}`
type_text_at	พิมพ์ข้อความที่พิกัดที่เฉพาะเจาะจง โดยค่าเริ่มต้นคือล้างช่องก่อนแล้วกด ENTER หลังจากพิมพ์ แต่สามารถปิดใช้ได้ x และ y จะอิงตามตารางกริด 1000x1000	`y`: int (0-999), `x`: int (0-999), `text`: str, `press_enter`: bool (ไม่บังคับ ค่าเริ่มต้นคือ True), `clear_before_typing`: bool (ไม่บังคับ ค่าเริ่มต้นคือ True)	`{"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search query", "press_enter": false}}`
key_combination	กดแป้นพิมพ์หรือชุดแป้น เช่น "Control+C" หรือ "Enter" มีประโยชน์ในการเรียกใช้การดำเนินการ (เช่น การส่งแบบฟอร์มด้วย "Enter") หรือการดำเนินการในคลิปบอร์ด	`keys`: สตริง (เช่น "enter", "control+c")	`{"name": "key_combination", "args": {"keys": "Control+A"}}`
scroll_document	เลื่อนทั้งหน้าเว็บ "ขึ้น" "ลง" "ซ้าย" หรือ "ขวา"	`direction`: str ("up", "down", "left" หรือ "right")	`{"name": "scroll_document", "args": {"direction": "down"}}`
scroll_at	เลื่อนองค์ประกอบหรือพื้นที่ที่เฉพาะเจาะจงที่พิกัด (x, y) ในทิศทางที่ระบุตามขนาดที่กำหนด พิกัดและขนาด (ค่าเริ่มต้นคือ 800) อิงตามตารางกริดขนาด 1000x1000	`y`: int (0-999), `x`: int (0-999), `direction`: str ("up", "down", "left", "right"), `magnitude`: int (0-999, ไม่บังคับ, ค่าเริ่มต้น 800)	`{"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down", "magnitude": 400}}`
drag_and_drop	ลากองค์ประกอบจากพิกัดเริ่มต้น (x, y) และวางที่พิกัดปลายทาง (destination_x, destination_y) พิกัดทั้งหมดอิงตามตารางกริดขนาด 1000x1000	`y`: int (0-999), `x`: int (0-999), `destination_y`: int (0-999), `destination_x`: int (0-999)	`{"name": "drag_and_drop", "args": {"y": 100, "x": 100, "destination_y": 500, "destination_x": 500}}`

ความปลอดภัย

รับทราบการตัดสินใจด้านความปลอดภัย

การตอบกลับของโมเดลอาจมีsafety_decisionจากระบบความปลอดภัยภายในที่ตรวจสอบ การดำเนินการที่โมเดลเสนอ ด้วย ทั้งนี้ขึ้นอยู่กับการดำเนินการ

{
  "content": {
    "parts": [
      {
        "text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95).",
      },
      {
        "function_call": {
          "name": "click_at",
          "args": {
            "x": 60,
            "y": 100,
            "safety_decision": {
              "explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
              "decision": "require_confirmation"
            }
          }
        }
      }
    ]
  }
}

หาก safety_decision เป็น require_confirmation คุณต้อง ขอให้ผู้ใช้ปลายทางยืนยันก่อนดำเนินการต่อเพื่อดำเนินการ ตามข้อกำหนดในการให้บริการ คุณไม่ได้รับอนุญาต ให้หลีกเลี่ยงคำขอการยืนยันจากมนุษย์

ตัวอย่างโค้ดนี้จะแจ้งให้ผู้ใช้ปลายทางยืนยันก่อนดำเนินการ การดำเนินการ หากผู้ใช้ไม่ยืนยันการดำเนินการ ลูปจะสิ้นสุด หากผู้ใช้ยืนยันการดำเนินการ ระบบจะดำเนินการดังกล่าวและทำเครื่องหมายช่อง safety_acknowledgement เป็น True

Python

import termcolor

def get_safety_confirmation(safety_decision):
    """Prompt user for confirmation when safety check is triggered."""
    termcolor.cprint("Safety service requires explicit confirmation!", color="red")
    print(safety_decision["explanation"])

    decision = ""
    while decision.lower() not in ("y", "n", "ye", "yes", "no"):
        decision = input("Do you wish to proceed? [Y]es/[N]o\n")

    if decision.lower() in ("n", "no"):
        return "TERMINATE"
    return "CONTINUE"

def execute_function_calls(candidate, page, screen_width, screen_height):

    # ... Extract function calls from response ...

    for function_call in function_calls:
        extra_fr_fields = {}

        # Check for safety decision
        if 'safety_decision' in function_call.args:
            decision = get_safety_confirmation(function_call.args['safety_decision'])
            if decision == "TERMINATE":
                print("Terminating agent loop")
                break
            extra_fr_fields["safety_acknowledgement"] = "true" # Safety acknowledgement

        # ... Execute function call and append to results ...

หากผู้ใช้ยืนยัน คุณต้องใส่การรับทราบเรื่องความปลอดภัยใน FunctionResponse

Python

function_response_parts.append(
    FunctionResponse(
        name=name,
        response={"url": current_url,
                  **extra_fr_fields},  # Include safety acknowledgement
        parts=[
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/png", data=screenshot
                )
             )
           ]
         )
       )

แนวทางปฏิบัติแนะนำด้านความปลอดภัย

การใช้คอมพิวเตอร์เป็นเครื่องมือใหม่ที่นำมาซึ่งความเสี่ยงใหม่ๆ ที่นักพัฒนาแอปควรตระหนักถึง

เนื้อหาที่ไม่น่าเชื่อถือและการหลอกลวง: ขณะที่โมเดลพยายามบรรลุเป้าหมายของผู้ใช้ โมเดลอาจอาศัยแหล่งข้อมูลและวิธีการที่ไม่น่าเชื่อถือ จากหน้าจอ ตัวอย่างเช่น หากเป้าหมายของผู้ใช้คือการซื้อโทรศัพท์ Pixel และโมเดลพบกลโกง "รับ Pixel ฟรีหากทำแบบสำรวจเสร็จ" ก็มีโอกาสที่โมเดลจะทำแบบสำรวจให้เสร็จ
การดำเนินการที่ไม่ตั้งใจเป็นครั้งคราว: โมเดลอาจตีความเป้าหมายของผู้ใช้ หรือเนื้อหาหน้าเว็บผิด ซึ่งทําให้ดำเนินการไม่ถูกต้อง เช่น คลิกปุ่ม ผิดหรือกรอกแบบฟอร์มผิด ซึ่งอาจทำให้งานล้มเหลวหรือ การกรองข้อมูลออก
การละเมิดนโยบาย: ความสามารถของ API อาจถูกนำไปใช้ ไม่ว่าจะโดย ตั้งใจหรือไม่ตั้งใจ เพื่อทำกิจกรรมที่ละเมิดนโยบายของ Google (นโยบายการใช้งานที่ไม่อนุญาตของ Gen AI และข้อกำหนดในการให้บริการเพิ่มเติมของ Gemini API ซึ่งรวมถึงการดำเนินการที่ อาจรบกวนความสมบูรณ์ของระบบ บั่นทอนความปลอดภัย หลีกเลี่ยงมาตรการรักษาความปลอดภัย ควบคุมอุปกรณ์ทางการแพทย์ ฯลฯ

หากต้องการจัดการความเสี่ยงเหล่านี้ คุณสามารถใช้มาตรการด้านความปลอดภัยและแนวทางปฏิบัติแนะนำต่อไปนี้ได้

การตรวจสอบโดยมนุษย์ (HITL):

ใช้การยืนยันผู้ใช้: เมื่อการตอบกลับด้านความปลอดภัยระบุว่า require_confirmation คุณต้องใช้การยืนยันผู้ใช้ก่อน การดำเนินการ ดูโค้ดตัวอย่างได้ที่รับทราบการตัดสินด้านความปลอดภัย

ระบุวิธีการด้านความปลอดภัยที่กำหนดเอง: นอกเหนือจากการตรวจสอบการยืนยันผู้ใช้ในตัวแล้ว นักพัฒนาแอปอาจเลือกเพิ่มคำสั่งของระบบที่กำหนดเอง ซึ่งบังคับใช้นโยบายด้านความปลอดภัยของตนเอง ไม่ว่าจะเพื่อบล็อกการดำเนินการบางอย่างของโมเดล หรือกำหนดให้ผู้ใช้ยืนยันก่อนที่โมเดลจะดำเนินการบางอย่าง ที่มีความเสี่ยงสูงและย้อนกลับไม่ได้ นี่คือตัวอย่างคำสั่งระบบความปลอดภัยที่กำหนดเอง ซึ่งคุณอาจรวมไว้เมื่อโต้ตอบกับโมเดล

ตัวอย่างคำแนะนำด้านความปลอดภัย

ตั้งกฎความปลอดภัยที่กำหนดเองเป็นคำสั่งของระบบโดยทำดังนี้

    ## **RULE 1: Seek User Confirmation (USER_CONFIRMATION)**

    This is your first and most important check. If the next required action falls
    into any of the following categories, you MUST stop immediately, and seek the
    user's explicit permission.

    **Procedure for Seeking Confirmation:**  * **For Consequential Actions:**
    Perform all preparatory steps (e.g., navigating, filling out forms, typing a
    message). You will ask for confirmation **AFTER** all necessary information is
    entered on the screen, but **BEFORE** you perform the final, irreversible action
    (e.g., before clicking "Send", "Submit", "Confirm Purchase", "Share").  * **For
    Prohibited Actions:** If the action is strictly forbidden (e.g., accepting legal
    terms, solving a CAPTCHA), you must first inform the user about the required
    action and ask for their confirmation to proceed.

    **USER_CONFIRMATION Categories:**

    *   **Consent and Agreements:** You are FORBIDDEN from accepting, selecting, or
        agreeing to any of the following on the user's behalf. You must ask the
        user to confirm before performing these actions.
        *   Terms of Service
        *   Privacy Policies
        *   Cookie consent banners
        *   End User License Agreements (EULAs)
        *   Any other legally significant contracts or agreements.
    *   **Robot Detection:** You MUST NEVER attempt to solve or bypass the
        following. You must ask the user to confirm before performing these actions.
    *   CAPTCHAs (of any kind)
        *   Any other anti-robot or human-verification mechanisms, even if you are
            capable.
    *   **Financial Transactions:**
        *   Completing any purchase.
        *   Managing or moving money (e.g., transfers, payments).
        *   Purchasing regulated goods or participating in gambling.
    *   **Sending Communications:**
        *   Sending emails.
        *   Sending messages on any platform (e.g., social media, chat apps).
        *   Posting content on social media or forums.
    *   **Accessing or Modifying Sensitive Information:**
        *   Health, financial, or government records (e.g., medical history, tax
            forms, passport status).
        *   Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
            account number, credit card number).
    *   **User Data Management:**
        *   Accessing, downloading, or saving files from the web.
        *   Sharing or sending files/data to any third party.
        *   Transferring user data between systems.
    *   **Browser Data Usage:**
        *   Accessing or managing Chrome browsing history, bookmarks, autofill data,
            or saved passwords.
    *   **Security and Identity:**
        *   Logging into any user account.
        *   Any action that involves misrepresentation or impersonation (e.g.,
            creating a fan account, posting as someone else).
    *   **Insurmountable Obstacles:** If you are technically unable to interact with
        a user interface element or are stuck in a loop you cannot resolve, ask the
        user to take over.
    ---

    ## **RULE 2: Default Behavior (ACTUATE)**

    If an action does **NOT** fall under the conditions for `USER_CONFIRMATION`,
    your default behavior is to **Actuate**.

    **Actuation Means:**  You MUST proactively perform all necessary steps to move
    the user's request forward. Continue to actuate until you either complete the
    non-consequential task or encounter a condition defined in Rule 1.

    *   **Example 1:** If asked to send money, you will navigate to the payment
        portal, enter the recipient's details, and enter the amount. You will then
        **STOP** as per Rule 1 and ask for confirmation before clicking the final
        "Send" button.
    *   **Example 2:** If asked to post a message, you will navigate to the site,
        open the post composition window, and write the full message. You will then
        **STOP** as per Rule 1 and ask for confirmation before clicking the final
        "Post" button.

        After the user has confirmed, remember to get the user's latest screen
        before continuing to perform actions.

    # Final Response Guidelines:
    Write final response to the user in the following cases:
    - User confirmation
    - When the task is complete or you have enough information to respond to the user

สภาพแวดล้อมการดำเนินการที่ปลอดภัย: เรียกใช้เอเจนต์ในสภาพแวดล้อมแซนด์บ็อกซ์ที่ปลอดภัย เพื่อจำกัดผลกระทบที่อาจเกิดขึ้น (เช่น เครื่องเสมือน (VM) ที่อยู่ในแซนด์บ็อกซ์ คอนเทนเนอร์ (เช่น Docker) หรือโปรไฟล์เบราว์เซอร์เฉพาะที่มีสิทธิ์แบบจำกัด
การล้างข้อมูลอินพุต: ล้างข้อความทั้งหมดที่ผู้ใช้สร้างขึ้นในพรอมต์เพื่อลดความเสี่ยงของวิธีการที่ไม่ต้องการหรือการแทรกโค้ดพรอมต์ ซึ่งเป็น การรักษาความปลอดภัยที่มีประโยชน์ แต่ไม่สามารถใช้แทนสภาพแวดล้อมการดำเนินการที่ปลอดภัยได้
แนวทางการควบคุมเนื้อหา: ใช้แนวทางการควบคุมและAPI ความปลอดภัยของเนื้อหาเพื่อประเมินข้อมูลที่ผู้ใช้ป้อน ข้อมูลและการแสดงผลของเครื่องมือ คำตอบของเอเจนต์ว่าเหมาะสมหรือไม่ การแทรกพรอมต์ และการตรวจหาการหลบเลี่ยง
รายการที่อนุญาตและรายการที่บล็อก: ใช้กลไกการกรองเพื่อควบคุม ตำแหน่งที่โมเดลสามารถไปยังส่วนต่างๆ และสิ่งที่โมเดลทำได้ การใช้รายการที่บล็อกเว็บไซต์ที่ห้าม เป็นจุดเริ่มต้นที่ดี ในขณะที่การใช้รายการที่อนุญาตที่เข้มงวดมากขึ้นจะ ปลอดภัยยิ่งกว่า
การสังเกตการณ์และการบันทึก: จัดเก็บบันทึกโดยละเอียดสำหรับการแก้ไขข้อบกพร่อง การตรวจสอบ และการตอบสนองต่อเหตุการณ์ ไคลเอ็นต์ควรบันทึกพรอมต์ ภาพหน้าจอ การดำเนินการที่โมเดลแนะนำ (function_call) การตอบกลับด้านความปลอดภัย และ การดำเนินการทั้งหมดที่ไคลเอ็นต์ดำเนินการในท้ายที่สุด
การจัดการสภาพแวดล้อม: ตรวจสอบว่าสภาพแวดล้อม GUI สอดคล้องกัน ป๊อปอัป การแจ้งเตือน หรือการเปลี่ยนแปลงเลย์เอาต์ที่ไม่คาดคิดอาจทำให้โมเดลสับสน เริ่มต้นจากสถานะที่ทราบและสะอาดสำหรับงานใหม่แต่ละงานหากเป็นไปได้

เวอร์ชันของโมเดล

พร็อพเพอร์ตี้	คำอธิบาย
รหัสโมเดล	Gemini API `gemini-2.5-computer-use-preview-10-2025`
ประเภทข้อมูลที่รองรับ	อินพุต รูปภาพ ข้อความ เอาต์พุต ข้อความ
ขีดจำกัดของโทเค็น^[*]	ขีดจำกัดของโทเค็นอินพุต 128,000 ขีดจำกัดโทเค็นเอาต์พุต 64,000
เวอร์ชัน	อ่านรายละเอียดเพิ่มเติมได้ในรูปแบบเวอร์ชันของโมเดล ตัวอย่าง: `gemini-2.5-computer-use-preview-10-2025`
การอัปเดตล่าสุด	ตุลาคม 2025

ขั้นตอนถัดไป

ทดลองใช้คอมพิวเตอร์ในสภาพแวดล้อมเดโมของ Browserbase
ดูโค้ดตัวอย่างได้ที่การติดตั้งใช้งาน อ้างอิง
ดูข้อมูลเกี่ยวกับเครื่องมืออื่นๆ ของ Gemini API
- การเรียกใช้ฟังก์ชัน
- การเชื่อมต่อแหล่งข้อมูลกับ Google Search