October 29, 2025
As LLMs continue to evolve, AI application forms have undergone three generational shifts:
Agents can autonomously perceive their environment, plan steps, and invoke tools, enabling them to operate computers like humans: automatically browsing web pages to collect information, generating and running code to analyze data, executing system commands to manage files, and even completing complex multi-step operations through visual interfaces. This capability allows Agent deliverables to approach or even exceed human professional standards.


A well-configured computer can significantly improve human work efficiency; similarly, a powerful sandbox environment can also improve Agent task quality and execution speed.
One-sentence introduction: AIO Sandbox integrates browser, code execution, terminal, visual takeover, forward and reverse proxy, MCP, authentication and other basic functions in a single sandbox, allowing environment customization based on needs, enabling different Agents to "complete tasks more efficiently in a unified environment container".

Website: sandbox.agent-infra.com
Github: github.com/agent-infra/sandbox

/mcp protocol, also providing API / SDK for customizing sandbox toolsets.{port}-{domain} wildcard domains or /proxy|/absproxy/{port} paths to services inside the sandbox (convenient for preview/demo).
| Instruction | Replay | Screenshot |
|---|---|---|
| Help me design an interesting website to introduce sauropod dinosaurs from the Jurassic and Cretaceous periods for elementary school children. I want the website to be cartoon-styled. | Replay | ![]() |
| Search for news about ByteDance's Seed 1.6 model, then write a modern-styled webpage and deploy it | Replay | ![]() |
| Based on this OSWorld image, please search for the latest information on the internet and design a modern website for it. | Replay | ![]() |
| Play Poki 2048 game | Replay | ![]() |
One-Click Deploy All-in-One Sandbox Application--Function Service-Volcano Engine

Prerequisites: Install Docker, then start locally with one command:

AIO Sandbox provides Agents with basic capabilities like Browser, File, Shell, Code, and offers extensibility to support developers in combining and customizing dedicated sandboxes based on Agent needs (such as AIO Sandbox for Mobile/Medical/Legal/Finance/Scientific Research).
Sandbox customization levels increase progressively:
/mcp endpoint, suitable for quick PoC Agent validation.web_search); also extend Skills to implement automated handling of specific sandbox tasks.FROM aio.sandbox base image, install specific dependencies (such as multimedia/image processing, etc.), mount custom services (e.g., /custom_tools/ocr image recognition).

Browser environment for Agents, core is providing CDP and VNC, mainstream Browser Use frameworks can be used directly; AIO provides x11-based browser GUI visual operation interface, which can be combined with CDP for more efficient, lower risk-control Browser Use solutions.

CDP (Chrome Devtools Protocol) is a protocol for communicating with Chrome or Chromium browsers, providing browser control APIs via WebSocket for navigation and loading, DOM manipulation, JS execution/debugging, network interception and simulation, screenshots and rendering, security and permissions, etc. For a more intuitive understanding, here's an example of using CDP to initiate a page navigation command:
Visit http://localhost:9222/json/version, where webSocketDebuggerUrl is the CDP address:
After establishing a WebSocket connection with CDP, you can execute browser commands:

Note: AIO Sandbox doesn't directly expose the CDP interface
/json/version, but relays CDP through the uvicorn service and adds heartbeat detection to avoid ws disconnection issues.
Screenshots
Unlike CDP-based screenshots, visual screenshots /v1/browser/screenshot include Tabs (the entire browser window), and operations target the entire browser window.
| GUI Browser Screenshot (Tabs) | CDP-based Page Screenshot (Page) |
|---|---|
![]() | ![]() |
Unlike CDP browser operations, visual operations /v1/browser/actions simulate human behavior for clicking, typing, scrolling, etc., which can reduce target website risk control strategies.
Unified Action Space Abstract GUI operations into composable minimal atomic actions, such as moving mouse, clicking, dragging, scrolling, key press, text input, and additional utility functions like wait, aligning as closely as possible with VLM visual models executing actual actions.
| action_type | Description | Required Parameters | Optional Parameters |
|---|---|---|---|
MOVE_TO | Move mouse to specified position | x, y | - |
MOVE_REL | Move relative to current mouse position | x_offset, y_offset | - |
CLICK | Click operation | - | x, y, button, num_clicks |
MOUSE_DOWN | Press mouse button | - | button |
MOUSE_UP | Release mouse button | - | button |
RIGHT_CLICK | Right click | - | x, y |
DOUBLE_CLICK | Double click | - | x, y |
DRAG_TO | Drag to specified position | x, y | - |
DRAG_REL | Drag relative to current mouse position | x_offset, y_offset | - |
SCROLL | Scroll operation | - | dx, dy |
TYPING | Type text | text | - |
PRESS | Press key | key | - |
KEY_DOWN | Press keyboard key | key | - |
KEY_UP | Release keyboard key | key | - |
HOTKEY | Key combination | keys (array), e.g.: ["ctrl", "c"] | - |
WAIT | Wait | duration time (seconds s) | - |
When Browser Use encounters login requirements, human takeover is generally needed, requiring an interactive browser interface. Currently there are two approaches:
/vnc/index.html page for direct user interaction.
The differences between the two takeover methods are roughly as follows:
| Comparison Dimension | VNC | Canvas + CDP (Chrome DevTools Protocol) |
|---|---|---|
| Technical Principle | Remote desktop protocol, transmits entire screen pixels | Controls browser via CDP, Canvas renders content |
| Transport Protocol | RFB (Remote Framebuffer) | WebSocket + CDP |
| Transport Content | Complete browser view (with Tabs) | Only browser current page content (no Tabs by default, can be implemented separately) |
| Bandwidth Usage | High (10-50 Mbps) | Low (1-5 Mbps) |
| Latency | Higher (50-200ms) | Lower (10-50ms) |
| Stability | Not easily disconnected | Easily disconnected, needs manual heartbeat with CDP to avoid disconnection |
| CPU Usage | High (desktop encoding) | Low (browser rendering only) |
| Memory Usage | High (needs complete desktop environment) | Low (browser process only) |
| Control Range | Entire browser | Browser internal pages only |
| Automation Capability | Basic (mouse keyboard simulation) | Powerful (DOM operations, network interception, JS injection, etc.) |
| Multi-window Support | ✅ Supported | ❌ Single browser window only |
| File Operations | ✅ Can operate local files | ❌ Limited by browser sandbox |
For Coding Agents, most tasks can be completed through command line execution. When designing the Shell module, using OpenHands' CmdRunAction as the execution engine, combined with tmux, implements multi-session execution capability.

File/code editing only requires two tools:
File CRUD: Encapsulates basic I/O for file read/write/list directory/create/upload/download, with path validation and permission control, covering common file operation scenarios.
Text Editor: Implements model-oriented fine-grained editing tool str_replace_editor, supporting:
view (view file or directory, including line range)str_replace (exact string replacement)insert (insert by line, legacy version support)undo_edit (undo)
Balancing language coverage and image size, using Python 3.10/3.11/3.12 and Node.js 22 runtimes from Sandbox Fusion, providing an integrated secure isolation environment for code execution.

Aggregates multiple MCP Servers (e.g., chrome-devtools-mcp) through unified entry point /mcp, supporting parameter-level filtering, and allowing tool name prefixing (namespacing).

Filter MCP Servers by search, future expansion will include tags (tags) and category (category) multi-dimensional filtering to reduce redundant calls and lower model token costs.

In Agent sandboxes, there are generally two types of scenarios corresponding to forward and reverse proxies:
Forward Proxy: Browser Use Agent can access private/global networks
Reverse Proxy: Coding Agent services developed inside the sandbox are exposed externally for user-side preview
Using TinyProxy proxy server to bypass geographic restrictions, access restricted content, or provide secure access within corporate intranets.

Why introduce TinyProxy when Chrome has --proxy-server to specify proxy?
The Chromium official documentation states that it will not use any username/password embedded in proxy settings (e.g., http://user:pass@host:port), authentication must go through a separate challenge dialog, affecting the entire Browser Use experience (as shown below):

Provides two methods to access service ports inside the Sandbox:
subdomain wildcard forwarding (recommended): Any domain matching ${port}-${domain} format will be forwarded to ports inside the sandbox.

subpath forwarding: Encounters many issues: for routing-sensitive services (like frontend projects), the additional /proxy|absproxy/${port} path causes resource matching 404s.
Agent operations in the sandbox generate user data. To implement unified AIO Sandbox authentication without intrusion, without modifying any existing business routing configuration, and without increasing the mental burden of future routing configuration expansion, an "asymmetric encryption + JWT" reverse proxy architecture was designed at the internal Nginx gateway layer:

JWT_PUBLIC_KEYBusiness service uses private key to generate a JWT valid for 1 hour. Below is a simplified script to generate JWT, in practice business backend should use mature JWT libraries:
Header Authentication
Short-Lived Ticket Authentication Example (using VNC page access): Direct access cannot authenticate via Header method, can only use ?ticket= ticket as query parameter.
TICKET_TTL_SECONDS environment variable)${ticket} variable to build the VNC URL and initiate access.In AIO, service processes (supervisord) and service routing (Nginx) are automatically mounted following convention-based directories:
/opt/gem/supervisord/*.conf/opt/gem/nginx/*.confTo customize services and routing on top of the AIO image, refer to the following image code:
Using fern to convert AIO Sandbox API documentation directly into Python / Go / Node.js SDKs. Using Python as an example, a few lines of code connect AIO Sandbox's core functionality:
More usage examples: agent-infra/sandbox#examples
Just add 4 lines of code to integrate the community's browser-use:

Complete code: browser-use#main.py

Complete code: langgraph-deepagents#main.py
You can use API / SDK to compose high-level toolsets needed by Agents, for example link_reader returns page content for a URL:
Currently the best public cloud deployment form is function computing, based on Sandbox's designated instance access capability: One-Click Deploy All-in-One Sandbox Application--Function Service-Volcano Engine

AIO Sandbox provides an integrated, customizable base environment (Agent Env), enabling Agents to complete diverse tasks including browsing, executing code, running commands, and file operations within the same environment, while supporting customization of domain-specific sandboxes for different Agents. This sandbox system will continue to evolve and expand alongside the rising intelligence ceiling of Agents and the creativity of developers.
Going forward, we will continue to refine stability, observability, and ecosystem integration, continuously improve evaluation systems and best practices, driving robust deployment and efficient operation of AIO Sandbox in more large-scale, high-demand Agent application scenarios.

| Term | Explanation |
|---|---|
| Agent | In the LLM context, an AI Agent is an intelligent entity that can autonomously understand intent, plan decisions, and execute complex tasks. An Agent is not an upgraded version of ChatGPT; it doesn't just tell you "how to do it," but actually helps you do it. If Copilot is the co-pilot, then Agent is the main driver. Similar to the human process of "doing things," an Agent's core functions can be summarized as a loop of three steps: Perception, Planning, and Action. |
| Copilot | Copilot refers to an AI-based assistance tool, typically integrated with specific software or applications, designed to help users improve work efficiency. Copilot systems analyze user behavior, inputs, data, and history to provide real-time suggestions, automate tasks, or enhance functionality, helping users make decisions or simplify operations. |
| AIO | All-In-One, refers to integrating multiple capabilities (Browser, Code Execution, Shell, File, visual takeover, authentication, proxy, etc.) within a single image/instance, reducing cross-environment switching and data transfer. |
| Sandbox | A controlled, isolated execution environment. Used to run browsers, code, or command lines, controlling resources and permissions, reducing impact and risk to the host system. |
| CDP | CDP (Chrome Devtools Protocol) is a protocol for communicating with Chrome or Chromium browsers. It allows developers to interact with browsers by sending commands and receiving events for debugging, analysis, and automated browser operations. CDP provides a set of APIs (Application Programming Interface) defining browser behavior and functionality. |
| VNC | VNC is a suite of "remote desktop sharing/control" technologies and tools based on the RFB (Remote Framebuffer) protocol. Core idea: encode the remote host's screen framebuffer (pixels) and transmit over network to the client, while replaying client keyboard and mouse events to the remote host, enabling cross-platform remote operation. |
| MCP | Model Context Protocol is an open protocol that standardizes how applications provide context to LLMs. Think of MCP as the USB-C port for AI applications. Just like USB-C provides a standard way for your devices to connect to various peripherals and accessories, MCP provides a standard way for your AI models to connect to different data sources and tools. |
| Browser Use | General term for Agents completing tasks like search, login, clicking, form filling, downloading through browsers, either via CDP commands or GUI visual operations. |
| OpenHands | OpenHands is an open-source AI Software Developer Agent Platform for training, evaluating, and running large language models (LLM) that can "autonomously program" in real development environments. It was initially released as OpenDevin, later renamed to OpenHands, maintained by the All Hands AI community. |