English

Browser & VNC

AIO Sandbox provides a full browser environment with VNC (Virtual Network Computing) access, enabling visual interaction with web applications and GUI-based workflows.

Overview

AIO Sandbox offers multiple ways to interact with the browser:

CDP (Chrome DevTools Protocol): Low-level programmatic control
VNC Access: Full desktop environment with visual access
GUI Actions: Visual screenshots and interactions
Browser Automation: Integration with Playwright and Puppeteer

Connection

CDP (Chrome DevTools Protocol)

Chrome DevTools Protocol (CDP) is a low‑level, language‑agnostic protocol that allows external programs to instrument, inspect, and control Chrome or Chromium‑based browsers.

1. /v1/browser/info

2. /json/version

curl -X 'GET' \
  'http://127.0.0.1:8080/v1/browser/info' \
  -H 'accept: application/json' \
  | jq '.data.cdp_url'

Browser Automation

Chrome DevTools Protocol (CDP)

AIO Sandbox exposes CDP for programmatic browser control:

# Get CDP endpoint
curl http://localhost:8080/cdp/json/version
# Or Get Browser Info (response data.cdp_url)
curl http://localhost:8080/v1/browser/info

Response includes webSocketDebuggerUrl for connecting automation tools.

Python SDK Integration

The Python SDK provides both synchronous and asynchronous clients for browser control:

Sync Client

Async Client

from agent_sandbox import Sandbox
from agent_sandbox.browser import Action_Click, Action_MoveTo, Action_Typing

# Initialize client
client = Sandbox(base_url="http://localhost:8080")

# Get browser information
browser_info = client.browser.get_info()
print(f"CDP URL: {browser_info.cdp_url}")
print(f"Viewport: {browser_info.viewport}")

# Take screenshot
screenshot_data = client.browser.take_screenshot()
with open("screenshot.png", "wb") as f:
    for chunk in screenshot_data:
        f.write(chunk)

# Execute GUI actions
# Move mouse to position
client.browser.execute_action(
    request=Action_MoveTo(x=500, y=300)
)

# Click at current position
client.browser.execute_action(
    request=Action_Click()
)

# Type text
client.browser.execute_action(
    request=Action_Typing(text="Hello World")
)

Browser Use Integration

Example with the browser_use Python library:

import requests
from agent_sandbox import Sandbox
from browser_use.browser.browser import BrowserSession, BrowserProfile

# Get CDP URL
client = Sandbox(base_url="http://localhost:8080")
cdp_url = client.browser.get_info().cdp_url

# Configure browser profile
profile = {
    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "ignore_https_errors": True,
    "viewport": {"width": 1920, "height": 1080},
}

# Create session
browser_session = BrowserSession(
    browser_profile=BrowserProfile(**profile),
    cdp_url=cdp_url
)

await browser_session.start()
page = await browser_session.browser_context.new_page()
await page.goto("https://example.com")

Playwright Integration

Works with Playwright for cross-browser testing:

from playwright.async_api import async_playwright
from agent_sandbox import Sandbox

client = Sandbox(base_url="http://localhost:8080")

async with async_playwright() as p:
    browser_info = client.browser.get_info()
    cdp_url = browser_info.cdp_url

    browser = await p.chromium.connect_over_cdp(cdp_url)
    page = await browser.new_page()
    await page.goto("https://example.com")
    await page.screenshot(path="screenshot.png")

    # Perform browser automation
    await page.fill('input[name="search"]', 'test query')
    await page.click('button[type="submit"]')
    await page.wait_for_load_state('networkidle')

MCP

Once connected to /mcp endpoint, all tools with the browser_ prefix are browser-related tools that provide comprehensive browser control capabilities. These tools include navigation, interaction, screenshot capture, and more.

For detailed implementation and usage, see @agent-infra/mcp-server-browser.

GUI Actions

GUI actions provide visual screenshot-based interactions with the browser. Unlike browser automation, GUI operations use pure visual screenshots and interactions, which can be advantageous in strict risk-control scenarios where DOM manipulation is restricted.

Screenshot

Python

Curl

screenshot = client.browser.screenshot()
print(screenshot)

Return an image in the format image/png:

GUI Actions

Python

Curl

from agent_sandbox.browser import (
    Action_MoveTo, Action_Click, Action_Typing,
    Action_Scroll, Action_Hotkey, Action_DragTo
)

# Move mouse to coordinates
action_res = client.browser.execute_action(
    request=Action_MoveTo(x=100, y=100)
)

# Click with options
action_res = client.browser.execute_action(
    request=Action_Click(x=200, y=200, num_clicks=2)
)

# Type text with clipboard option
action_res = client.browser.execute_action(
    request=Action_Typing(text="Hello World", use_clipboard=True)
)

# Scroll the page
action_res = client.browser.execute_action(
    request=Action_Scroll(dx=0, dy=100)
)

# Execute hotkey combination
action_res = client.browser.execute_action(
    request=Action_Hotkey(keys=["ctrl", "c"])
)

Available Action Types

action_type	Description	Required	Optional
`MOVE_TO`	Move the mouse to the specified position	`x`, `y`	-
`CLICK`	Click operation	-	`x`, `y`, `button`, `num_clicks`
`MOUSE_DOWN`	Press the mouse button	-	`button`
`MOUSE_UP`	Release the mouse button	-	`button`
`RIGHT_CLICK`	Right-click	-	`x`, `y`
`DOUBLE_CLICK`	Double-click	-	`x`, `y`
`DRAG_TO`	Drag to the specified location	`x`, `y`	-
`SCROLL`	Scroll operation	-	`dx`, `dy`
`TYPING`	Input text	`text`	`use_clipboard`
`PRESS`	Press key	`key`	-
`KEY_DOWN`	Press keyboard key	`key`	-
`KEY_UP`	Release keyboard key	`key`	-
`HOTKEY`	Key combination	`keys` (Array)	-

Example hotkey: ["ctrl", "c"] for copy, ["ctrl", "v"] for paste

Take Over

If you want to achieve Human-in-the-loop for browser use, there are two ways:

1. VNC Access

Access the VNC interface at or embed it directly into the application using an iframe:

http://localhost:8080/vnc/index.html?autoconnect=true

The VNC server provides:

Full desktop environment
Pre-installed Chrome browser
Mouse and keyboard interaction
Screen sharing capabilities

See EMBEDDING.md for more custom parameters.

2. CDP Access

You can use the @agent-infra/browser-ui React component library to connect to a CDP address for takeover. Below is a code example:

import React, { useRef } from 'react';
import { BrowserCanvas, BrowserCanvasRef, Browser, Page } from '@agent-infra/browser-ui';

function App() {
  const canvasRef = useRef<BrowserCanvasRef>(null);

  const handleReady = ({ browser, page }: { browser: Browser; page: Page }) => {
    console.log('Browser connected, current URL:', page.url());

    // Listen for navigation events
    page.on('framenavigated', (frame) => {
      if (frame === page.mainFrame()) {
        console.log('Navigated to:', frame.url());
      }
    });
  };

  const handleError = (error: Error) => {
    console.error('Browser connection error:', error);
  };

  return (
    <div style={{ width: '100%', height: '800px', position: 'relative' }}>
      <BrowserCanvas
        ref={canvasRef}
        cdpEndpoint="http://localhost:8080/json/version"
        onReady={handleReady}
        onError={handleError}
        onSessionEnd={() => console.log('Session ended')}
      />
    </div>
  );
}

VNC vs Canvas Comparison

Dimension	VNC	Canvas + CDP
Technology	Remote desktop protocol, transmits entire screen pixels	Controls browser via CDP, renders content on Canvas
Protocol	RFB (Remote Framebuffer)	WebSocket + CDP
Content	Complete browser UI with tabs	Current page content only (tabs can be implemented separately)
Bandwidth	High (10-50 Mbps)	Low (1-5 Mbps)
Latency	Higher (50-200ms)	Lower (10-50ms)
Stability	Less prone to disconnection	May disconnect, requires heartbeat with CDP
CPU Usage	High (desktop encoding)	Low (browser rendering only)
Memory Usage	High (full desktop environment)	Low (browser process only)
Control Scope	Entire browser	Browser internal pages only
Automation	Basic (mouse/keyboard simulation)	Powerful (DOM manipulation, network interception, JS injection)
Multi-window	✅ Supported	❌ Single browser window only
File Operations	✅ Can access local files	❌ Limited by browser sandbox

Q&A

CDP vs MCP Tools - What's the Difference?

Abstraction Level: MCP provides high-level, ready-to-use abstractions, while CDP offers low-level, flexible control
Connection Stability: MCP connections are more stable as the container's MCP Server wraps CDP protocol and exposes HTTP interfaces
Flexibility: CDP is more flexible - once connected, you get browser and page instances for fine-grained control

#Browser & VNC

#Overview

#Connection

#CDP (Chrome DevTools Protocol)

#Browser Automation

#Chrome DevTools Protocol (CDP)

#Python SDK Integration

#Browser Use Integration

#Playwright Integration

#MCP

#GUI Actions

#Screenshot

#GUI Actions

#Available Action Types

#Take Over

#1. VNC Access

#2. CDP Access

#VNC vs Canvas Comparison

#Q&A

#CDP vs MCP Tools - What's the Difference?

Browser & VNC

Overview

Connection

CDP (Chrome DevTools Protocol)

Browser Automation

Chrome DevTools Protocol (CDP)

Python SDK Integration

Browser Use Integration

Playwright Integration

MCP

GUI Actions

Screenshot

GUI Actions

Available Action Types

Take Over

1. VNC Access

2. CDP Access

VNC vs Canvas Comparison

Q&A

CDP vs MCP Tools - What's the Difference?