October 29, 2025

AIO Sandbox: An Integrated and Customizable Sandbox Environment for AI Agents

Background

As LLMs continue to evolve, AI application forms have undergone three generational shifts:

  • Chatbot: Conversational interaction, answering questions
  • Copilot: Collaborative assistance, improving efficiency
  • Agent: Autonomous execution, completing tasks

Agents can autonomously perceive their environment, plan steps, and invoke tools, enabling them to operate computers like humans: automatically browsing web pages to collect information, generating and running code to analyze data, executing system commands to manage files, and even completing complex multi-step operations through visual interfaces. This capability allows Agent deliverables to approach or even exceed human professional standards.

Pain Points

  1. 🧩 Environment Fragmentation: Multiple single-function sandboxes (like E2B for code execution, Browserbase for browsers) force Agents to transfer data across sandboxes via NAS/OSS, increasing latency and complexity. For example: a deep research Agent completing 'convert a paper into a PPT' needs to exchange dozens of intermediate files (JSON configs, chart images, preview screenshots, etc.) across multiple sandboxes, adding complexity and overhead to the entire Agent system.

Different functional sandboxes sharing and collaborating

  1. 🎁 Difficult Customization: Different types of Agents require different pre-installed tech stacks. Traditional sandboxes provide unified pre-installed environments that cannot meet all Agents' personalized needs.

Different Agents have different pre-installed packages in sandbox environments

  1. 🔒 Security Isolation Challenges: Need to give Agents real system capabilities (network, files, browser, GPU) while maintaining strong isolation to prevent unauthorized access and data leaks.
  2. 🖥️ Difficult Visual Interaction: Complex Agent tasks require human takeover, functional sandboxes need to integrate VNC, Terminal, VSCode to maintain consistent experience. Resolution switching, screenshots, and GUI visual operations.
  3. 🌐 Browser Environment Complexity: Anti-automation and fingerprint risk control, CDP instability, inadequate proxy support with username/password, missing GUI operations.

A well-configured computer can significantly improve human work efficiency; similarly, a powerful sandbox environment can also improve Agent task quality and execution speed.

Introduction

One-sentence introduction: AIO Sandbox integrates browser, code execution, terminal, visual takeover, forward and reverse proxy, MCP, authentication and other basic functions in a single sandbox, allowing environment customization based on needs, enabling different Agents to "complete tasks more efficiently in a unified environment container".

AIO (All-in-One) Sandbox

Features

  • 📦 Out-of-the-box: Connect directly to sandbox capabilities via /mcp protocol, also providing API / SDK for customizing sandbox toolsets.
  • 🚀 Second-level Startup: Full sandbox service startup completes in seconds, reaching millisecond-level after pre-caching/cold start.
  • 🌈 Customizable: Agents in various vertical scenarios need domain-specific tools and dependencies; AIO provides a unified image base, supporting on-demand expansion with convention-based routing and service configuration.
  • 🌐 Browser: Integrates Web Infra's RS lightweight kernel, providing CDP, screenshots, pure visual GUI operations, and Proxy configuration.
  • 🔄 Human Takeover: Provides browser VNC, Code Server, Terminal, supporting human takeover and debugging mid-task.
  • 📡 Proxy and Forwarding: Supports forward proxy with authentication; maps {port}-{domain} wildcard domains or /proxy|/absproxy/{port} paths to services inside the sandbox (convenient for preview/demo).
  • 🔒Security Authentication: JWT Bearer access control; provides Short-Lived Tickets for links that cannot carry Headers.

Examples

InstructionReplayScreenshot
Help me design an interesting website to introduce sauropod dinosaurs from the Jurassic and Cretaceous periods for elementary school children. I want the website to be cartoon-styled.Replay
Search for news about ByteDance's Seed 1.6 model, then write a modern-styled webpage and deploy itReplay
Based on this OSWorld image, please search for the latest information on the internet and design a modern website for it.Replay
Play Poki 2048 gameReplay

More at: https://seed-tars.com/showcase/ui-tars-2

Quick Start

Cloud

One-Click Deploy All-in-One Sandbox Application--Function Service-Volcano Engine

Local

Prerequisites: Install Docker, then start locally with one command:

docker run --rm -it -p 8080:8080 ghcr.io/agent-infra/sandbox:latest

# For faster access in China
# docker run --security-opt seccomp=unconfined --rm -it -p 8080:8080 enterprise-public-cn-beijing.cr.volces.com/vefaas-public/all-in-one-sandbox:latest

System Architecture

Overview

AIO Sandbox provides Agents with basic capabilities like Browser, File, Shell, Code, and offers extensibility to support developers in combining and customizing dedicated sandboxes based on Agent needs (such as AIO Sandbox for Mobile/Medical/Legal/Finance/Scientific Research). Sandbox customization levels increase progressively:

  1. Standard (Out-of-the-box): Plug-and-play for Agents via /mcp endpoint, suitable for quick PoC Agent validation.
  2. Custom Toolset (Tool/Skills Extension): Without modifying the image, add or orchestrate tools based on SDK/API (such as adding web_search); also extend Skills to implement automated handling of specific sandbox tasks.
  3. Custom Image: Based on FROM aio.sandbox base image, install specific dependencies (such as multimedia/image processing, etc.), mount custom services (e.g., /custom_tools/ocr image recognition).

Sandbox Extensible Architecture

Core Components

AIO Sandbox Component Diagram

Browser

Browser environment for Agents, core is providing CDP and VNC, mainstream Browser Use frameworks can be used directly; AIO provides x11-based browser GUI visual operation interface, which can be combined with CDP for more efficient, lower risk-control Browser Use solutions.

AIO Sandbox Browser Architecture

CDP

CDP (Chrome Devtools Protocol) is a protocol for communicating with Chrome or Chromium browsers, providing browser control APIs via WebSocket for navigation and loading, DOM manipulation, JS execution/debugging, network interception and simulation, screenshots and rendering, security and permissions, etc. For a more intuitive understanding, here's an example of using CDP to initiate a page navigation command:

'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' \
    --disable-gpu \
    --user-data-dir=./test \
    --remote-debugging-port=9222 \
    https://www.chromestatus.com

Visit http://localhost:9222/json/version, where webSocketDebuggerUrl is the CDP address:

$ curl http://localhost:9222/json/version
{
   "Browser": "Chrome/141.0.7390.66",
   "Protocol-Version": "1.3",
   "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36",
   "V8-Version": "14.1.146.11",
   "WebKit-Version": "537.36 (@95681a3c3d516c397b75ff45b8980c1088666775)",
   "webSocketDebuggerUrl": "ws://localhost:9222/devtools/browser/a6c5f19f-5d24-4bed-ba08-9c15cf5aeedb"
}

After establishing a WebSocket connection with CDP, you can execute browser commands:

Note: AIO Sandbox doesn't directly expose the CDP interface /json/version, but relays CDP through the uvicorn service and adds heartbeat detection to avoid ws disconnection issues.

GUI Visual Operations

Screenshots Unlike CDP-based screenshots, visual screenshots /v1/browser/screenshot include Tabs (the entire browser window), and operations target the entire browser window.

GUI Browser Screenshot (Tabs)CDP-based Page Screenshot (Page)

Unlike CDP browser operations, visual operations /v1/browser/actions simulate human behavior for clicking, typing, scrolling, etc., which can reduce target website risk control strategies.

Unified Action Space Abstract GUI operations into composable minimal atomic actions, such as moving mouse, clicking, dragging, scrolling, key press, text input, and additional utility functions like wait, aligning as closely as possible with VLM visual models executing actual actions.

action_typeDescriptionRequired ParametersOptional Parameters
MOVE_TOMove mouse to specified positionx, y-
MOVE_RELMove relative to current mouse positionx_offset, y_offset-
CLICKClick operation-x, y, button, num_clicks
MOUSE_DOWNPress mouse button-button
MOUSE_UPRelease mouse button-button
RIGHT_CLICKRight click-x, y
DOUBLE_CLICKDouble click-x, y
DRAG_TODrag to specified positionx, y-
DRAG_RELDrag relative to current mouse positionx_offset, y_offset-
SCROLLScroll operation-dx, dy
TYPINGType texttext-
PRESSPress keykey-
KEY_DOWNPress keyboard keykey-
KEY_UPRelease keyboard keykey-
HOTKEYKey combinationkeys (array), e.g.: ["ctrl", "c"]-
WAITWaitduration time (seconds s)-
Takeover

When Browser Use encounters login requirements, human takeover is generally needed, requiring an interactive browser interface. Currently there are two approaches:

  1. VNC Takeover: AIO Sandbox provides /vnc/index.html page for direct user interaction.

  1. Frontend connects via CDP, real-time redrawing complete browser interface on Canvas (Playground); we've packaged the frontend part into a component @agent-infra/browser-ui. Below, left is the actual browser, right is browser-ui screen mirroring:

The differences between the two takeover methods are roughly as follows:

Comparison DimensionVNCCanvas + CDP (Chrome DevTools Protocol)
Technical PrincipleRemote desktop protocol, transmits entire screen pixelsControls browser via CDP, Canvas renders content
Transport ProtocolRFB (Remote Framebuffer)WebSocket + CDP
Transport ContentComplete browser view (with Tabs)Only browser current page content (no Tabs by default, can be implemented separately)
Bandwidth UsageHigh (10-50 Mbps)Low (1-5 Mbps)
LatencyHigher (50-200ms)Lower (10-50ms)
StabilityNot easily disconnectedEasily disconnected, needs manual heartbeat with CDP to avoid disconnection
CPU UsageHigh (desktop encoding)Low (browser rendering only)
Memory UsageHigh (needs complete desktop environment)Low (browser process only)
Control RangeEntire browserBrowser internal pages only
Automation CapabilityBasic (mouse keyboard simulation)Powerful (DOM operations, network interception, JS injection, etc.)
Multi-window Support✅ Supported❌ Single browser window only
File Operations✅ Can operate local files❌ Limited by browser sandbox

Command Line Interpreter

For Coding Agents, most tasks can be completed through command line execution. When designing the Shell module, using OpenHands' CmdRunAction as the execution engine, combined with tmux, implements multi-session execution capability.

File Operations

File/code editing only requires two tools:

  • File CRUD: Encapsulates basic I/O for file read/write/list directory/create/upload/download, with path validation and permission control, covering common file operation scenarios.

  • Text Editor: Implements model-oriented fine-grained editing tool str_replace_editor, supporting:

    • view (view file or directory, including line range)
    • str_replace (exact string replacement)
    • insert (insert by line, legacy version support)
    • undo_edit (undo)

Code Execution

Balancing language coverage and image size, using Python 3.10/3.11/3.12 and Node.js 22 runtimes from Sandbox Fusion, providing an integrated secure isolation environment for code execution.

MCP Servers Aggregator

Aggregates multiple MCP Servers (e.g., chrome-devtools-mcp) through unified entry point /mcp, supporting parameter-level filtering, and allowing tool name prefixing (namespacing).

/mcp supports MCP Servers filtering

Filter MCP Servers by search, future expansion will include tags (tags) and category (category) multi-dimensional filtering to reduce redundant calls and lower model token costs.

Proxy

In Agent sandboxes, there are generally two types of scenarios corresponding to forward and reverse proxies:

  1. Forward Proxy: Browser Use Agent can access private/global networks

  2. Reverse Proxy: Coding Agent services developed inside the sandbox are exposed externally for user-side preview

Forward Proxy

Using TinyProxy proxy server to bypass geographic restrictions, access restricted content, or provide secure access within corporate intranets.

AIO Sandbox Forward Proxy Principle

Why introduce TinyProxy when Chrome has --proxy-server to specify proxy? The Chromium official documentation states that it will not use any username/password embedded in proxy settings (e.g., http://user:pass@host:port), authentication must go through a separate challenge dialog, affecting the entire Browser Use experience (as shown below):

Proxy with username and password triggers dialog

Reverse Proxy

AIO Sandbox Reverse Proxy Principle Provides two methods to access service ports inside the Sandbox:

  1. subdomain wildcard forwarding (recommended): Any domain matching ${port}-${domain} format will be forwarded to ports inside the sandbox.

  2. subpath forwarding: Encounters many issues: for routing-sensitive services (like frontend projects), the additional /proxy|absproxy/${port} path causes resource matching 404s.

Authentication

Agent operations in the sandbox generate user data. To implement unified AIO Sandbox authentication without intrusion, without modifying any existing business routing configuration, and without increasing the mental burden of future routing configuration expansion, an "asymmetric encryption + JWT" reverse proxy architecture was designed at the internal Nginx gateway layer:

How to Enable (One-time Configuration)

  • Generate key pair
openssl genrsa -out private_key.pem 2048
openssl rsa -in private_key.pem -pubout -out public_key.pem
echo "Key pair generation complete!"
  • Start service (with public key to enable authentication), using environment variable JWT_PUBLIC_KEY
export JWT_PUBLIC_KEY=$(cat public_key.pem | base64)
JWT_PUBLIC_KEY="${JWT_PUBLIC_KEY}"

Issue JWT

Business service uses private key to generate a JWT valid for 1 hour. Below is a simplified script to generate JWT, in practice business backend should use mature JWT libraries:

# This is a simplified script to generate JWT, in practice business backend should use mature JWT libraries
base64url_encode() { openssl base64 -e -A | tr '+/' '-_' | tr -d '='; }
header='{"alg":"RS256","typ":"JWT"}'
exp_time=$(($(date +%s) + 3600))
payload="{\"exp\":${exp_time}}"
to_be_signed="$(echo -n "$header" | base64url_encode).$(echo -n "$payload" | base64url_encode)"
signature=$(echo -n "$to_be_signed" | openssl dgst -sha256 -sign private_key.pem | base64url_encode)
jwt="${to_be_signed}.${signature}"
echo "JWT generated: ${jwt}"

Usage

  1. Header Authentication

    curl --silent -X GET "http://localhost:8080/v1/sandbox" \
         -H "Authorization: Bearer ${jwt}"
  2. Short-Lived Ticket Authentication Example (using VNC page access): Direct access cannot authenticate via Header method, can only use ?ticket= ticket as query parameter.

    • Use JWT to obtain ticket from common endpoint (default validity is 30s, can be configured via TICKET_TTL_SECONDS environment variable)
    echo "Using JWT to exchange for common one-time ticket..."
    
    ticket_response=$(curl --silent -X POST "http://localhost:8080/tickets" \
         -H "Authorization: Bearer ${jwt}")
    
    ticket=$(echo "$ticket_response" | jq -r .ticket)
    expires=$(echo "$ticket_response" | jq -r .expires_in)
    
    echo "Successfully obtained! Ticket: ${ticket}, Validity: ${expires} seconds"
    • Client builds and uses VNC URL: Now you can use the obtained ${ticket} variable to build the VNC URL and initiate access.
    # Bash script simulates client URL concatenation
    vnc_url="http://localhost:8080/vnc/index.html?ticket=${ticket}&path=websockify%3Fticket%3D${ticket}"
    
    echo "Client-built final URL: ${vnc_url}"
    
    # Simulate access (should be done in browser)
    # curl -I "${vnc_url}"

Extension and Ecosystem

Custom Images

In AIO, service processes (supervisord) and service routing (Nginx) are automatically mounted following convention-based directories:

  • Service process directory: /opt/gem/supervisord/*.conf
  • Routing directory: /opt/gem/nginx/*.conf

To customize services and routing on top of the AIO image, refer to the following image code:

FROM enterprise-public-cn-beijing.cr.volces.com/vefaas-public/all-in-one-sandbox:latest

# ----------------------
# Install additional system dependencies (if any)
# installed path: /usr/bin/*
# ----------------------
RUN set -eux; \
    apt-get update; \
    apt-get install -y --no-install-recommends \
        ${your_system_dep} \
        --no-install-recommends; \
    # clean up
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*;

# ----------------------
# npm install (if any)
#
# ----------------------
RUN npm i -g ${your_npm_package}

# ----------------------
# python pip install (if any)
# installed path: /usr/local/bin/*
# ----------------------
RUN pip install ${your_python_package}

# Add custom Server service
COPY ./supervisord.agent_server.conf /opt/tiger/run/supervisord/agent_server.conf
# Bind Nginx routing
COPY ./nginx.agent_server.conf /opt/gem/nginx/nginx.agent_server.conf

# # If you don't need services in AIO, you can delete them, e.g., Code Server
# ## Delete Code Server process and routing
# RUN rm -rf /opt/gem/supervisord/supervisord.code_server.conf
# ## Delete Code Server routing
# RUN rm -rf /opt/gem/nginx/code_server.conf

SDK Integration

Using fern to convert AIO Sandbox API documentation directly into Python / Go / Node.js SDKs. Using Python as an example, a few lines of code connect AIO Sandbox's core functionality:

from agent_sandbox import Sandbox

client = Sandbox(base_url="http://localhost:8080")

# Execute Shell
shell_res = client.shell.exec_command(command="ls -la")
print(shell_res.data.output) # /home/gem

# Browser Screenshot
screenshot = client.browser.screenshot()
print(screenshot)

# Get Browser CDP
browser_info = client.browser.get_browser_info()
cdp_url = browser_info.data.cdp_url # ws://

# Read File
file_res = client.file.read_file(file="/home/gem/.bashrc")
print(file_res.data.content)

More usage examples: agent-infra/sandbox#examples

browser-use

Just add 4 lines of code to integrate the community's browser-use:

Complete code: browser-use#main.py

LangGraph-DeepAgents

Complete code: langgraph-deepagents#main.py

Custom Toolsets

You can use API / SDK to compose high-level toolsets needed by Agents, for example link_reader returns page content for a URL:

from openai import OpenAI
from agent_sandbox import Sandbox
import json

client = OpenAI(
    api_key="your_api_key",
)
sandbox = Sandbox(base_url="http://localhost:8080")

tools = [{
    "type": "function",
    "function": {
        "name": "link_reader",
        "description": "Render and read webpage, return title, body text, and final URL (based on CDP).",
        "parameters": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "format": "uri"},
                "timeout_ms": {"type": "integer", "default": 30000}
            },
            "required": ["url"]
        }
    }
}]

async def link_reader(url: str, timeout_ms: int = 30_000) -> dict:
    cdp_url = sandbox.browser.get_browser_info().cdp_url
    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(cdp_url)
        try:
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle", timeout=timeout_ms)
            title = await page.title()
            text = await page.evaluate("document.body.innerText || ''")
            return {"final_url": page.url, "title": title, "text": text[:8000]}
        finally:
            await browser.close()

Deployment

Currently the best public cloud deployment form is function computing, based on Sandbox's designated instance access capability: One-Click Deploy All-in-One Sandbox Application--Function Service-Volcano Engine

Summary and Outlook

AIO Sandbox provides an integrated, customizable base environment (Agent Env), enabling Agents to complete diverse tasks including browsing, executing code, running commands, and file operations within the same environment, while supporting customization of domain-specific sandboxes for different Agents. This sandbox system will continue to evolve and expand alongside the rising intelligence ceiling of Agents and the creativity of developers.

Going forward, we will continue to refine stability, observability, and ecosystem integration, continuously improve evaluation systems and best practices, driving robust deployment and efficient operation of AIO Sandbox in more large-scale, high-demand Agent application scenarios.

Appendix

Terminology

TermExplanation
AgentIn the LLM context, an AI Agent is an intelligent entity that can autonomously understand intent, plan decisions, and execute complex tasks. An Agent is not an upgraded version of ChatGPT; it doesn't just tell you "how to do it," but actually helps you do it. If Copilot is the co-pilot, then Agent is the main driver. Similar to the human process of "doing things," an Agent's core functions can be summarized as a loop of three steps: Perception, Planning, and Action.
CopilotCopilot refers to an AI-based assistance tool, typically integrated with specific software or applications, designed to help users improve work efficiency. Copilot systems analyze user behavior, inputs, data, and history to provide real-time suggestions, automate tasks, or enhance functionality, helping users make decisions or simplify operations.
AIOAll-In-One, refers to integrating multiple capabilities (Browser, Code Execution, Shell, File, visual takeover, authentication, proxy, etc.) within a single image/instance, reducing cross-environment switching and data transfer.
SandboxA controlled, isolated execution environment. Used to run browsers, code, or command lines, controlling resources and permissions, reducing impact and risk to the host system.
CDPCDP (Chrome Devtools Protocol) is a protocol for communicating with Chrome or Chromium browsers. It allows developers to interact with browsers by sending commands and receiving events for debugging, analysis, and automated browser operations. CDP provides a set of APIs (Application Programming Interface) defining browser behavior and functionality.
VNCVNC is a suite of "remote desktop sharing/control" technologies and tools based on the RFB (Remote Framebuffer) protocol. Core idea: encode the remote host's screen framebuffer (pixels) and transmit over network to the client, while replaying client keyboard and mouse events to the remote host, enabling cross-platform remote operation.
MCPModel Context Protocol is an open protocol that standardizes how applications provide context to LLMs. Think of MCP as the USB-C port for AI applications. Just like USB-C provides a standard way for your devices to connect to various peripherals and accessories, MCP provides a standard way for your AI models to connect to different data sources and tools.
Browser UseGeneral term for Agents completing tasks like search, login, clicking, form filling, downloading through browsers, either via CDP commands or GUI visual operations.
OpenHandsOpenHands is an open-source AI Software Developer Agent Platform for training, evaluating, and running large language models (LLM) that can "autonomously program" in real development environments. It was initially released as OpenDevin, later renamed to OpenHands, maintained by the All Hands AI community.

References