OS-Level Control

OS-level control provides direct access to operating system primitives like mouse movements, keyboard input, and screen interactions within your browser sessions. This approach offers more precise control than traditional web automation methods and is particularly powerful when combined with AI agents and vision-based models.

Why OS-Level Control?

Superior AI Agent Performance

Vision-based AI models perform significantly better when they can interact with the browser using the same primitives humans use:

  • Keyboard shortcuts work naturally: Ctrl+F for searching, Ctrl+L for browser navbar interaction, Ctrl+T for new tabs
  • OS-level UI elements: Dropdowns, context menus, and system dialogs that aren’t part of the webpage DOM
  • Visual coordinate targeting: AI agents can directly click on elements they see in screenshots

Beyond Traditional Web Automation

Core Capabilities

Mouse Operations

Control mouse interactions with pixel-level precision:

// Single click at coordinates
const response = await axios.post(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/mouse/click`, {
  x: 400,
  y: 300,
  button: 'left' // 'left', 'middle', 'right'
}, {
  headers: {
    'anchor-api-key': 'your-api-key',
    'Content-Type': 'application/json'
  }
});

Drag and Drop

Perform complex drag and drop operations in a single command:

// Drag file from desktop to upload area
const response = await axios.post(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/drag-and-drop`, {
  startX: 200,
  startY: 150,
  endX: 600,
  endY: 400,
  button: 'left'
}, {
  headers: {
    'anchor-api-key': 'your-api-key',
    'Content-Type': 'application/json'
  }
});

Keyboard Input

Send text and keyboard shortcuts with human-like timing:

// Type text with optional delay between keystrokes
const response = await axios.post(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/keyboard/type`, {
  text: "Hello, world!",
  delay: 50 // milliseconds between keystrokes
}, {
  headers: {
    'anchor-api-key': 'your-api-key',
    'Content-Type': 'application/json'
  }
});

Scrolling

Control page scrolling with precision:

// Scroll at specific coordinates
const response = await axios.post(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/scroll`, {
  x: 400,           // Where to perform scroll (cursor position)
  y: 300,           // Where to perform scroll (cursor position)
  deltaX: 0,        // Horizontal scroll amount (does not correlate with pixels)
  deltaY: 200,      // Vertical scroll amount (does not correlate with pixels, positive = down)
  steps: 5          // Number of steps for smooth scrolling
}, {
  headers: {
    'anchor-api-key': 'your-api-key',
    'Content-Type': 'application/json'
  }
});

Screenshots

Capture visual state for AI analysis:

// Take screenshot of current browser state
const response = await axios.get(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/screenshot`, {
  headers: {
    'anchor-api-key': 'your-api-key'
  },
  responseType: 'arraybuffer'
});

const imageBuffer = response.data;
// Process screenshot with vision AI model

Clipboard Operations

Manage clipboard content programmatically:

// Get current clipboard content
const response = await axios.get(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/clipboard`, {
  headers: {
    'anchor-api-key': 'your-api-key'
  }
});

const { data } = response.data;
console.log('Clipboard content:', data.text);

Direct URL navigation at the OS level on the currently selected tab:

// Navigate to a specific URL (completely OS-level, operates on selected tab)
const response = await axios.post(`https://api.anchorbrowser.io/v1/sessions/${sessionId}/goto`, {
  url: "https://example.com"
}, {
  headers: {
    'anchor-api-key': 'your-api-key',
    'Content-Type': 'application/json'
  }
});

AI Agent Integration Patterns

OpenAI Computer Use Integration

AnchorBrowser includes an integrated OpenAI Computer Use agent that leverages OS-level control for enhanced AI interactions. This agent can perform complex tasks by combining vision models with precise OS-level operations.

class AnchorBrowser(BasePlaywrightComputer):
    """
    Computer implementation for Anchor browser (https://anchorbrowser.io)
    Uses OS-level control endpoints for browser automation within the container.

    IMPORTANT: The `goto` and navigation tools are already implemented and recommended
    when using the Anchor computer to help the agent navigate more effectively.
    """

    def __init__(self, width: int = 1024, height: int = 900, session_id: str = None):
        """Initialize the Anchor browser session"""
        super().__init__()
        self.dimensions = (width, height)
        self.session_id = session_id
        self.base_url = "http://localhost:8484/api/os-control"

    def screenshot(self) -> str:
        """
        Capture a screenshot using OS-level control API.
        
        Returns:
            str: A base64 encoded string of the screenshot for AI model consumption.
        """
        try:
            response = requests.get(f"{self.base_url}/screenshot")
            
            if not response.ok:
                print(f"OS-level screenshot failed, falling back to standard screenshot")
                return super().screenshot()
                
            # OS-level API returns binary PNG data, encoded for AI models
            return base64.b64encode(response.content).decode('utf-8')
            
        except Exception as error:
            print(f"OS-level screenshot failed, falling back: {error}")
            return super().screenshot()

    def click(self, x: int, y: int, button: str = "left") -> None:
        """
        Click at the specified coordinates using OS-level control.
        
        Args:
            x: The x-coordinate to click.
            y: The y-coordinate to click.
            button: The mouse button to use ('left', 'right').
        """
        try:
            response = requests.post(
                f"{self.base_url}/mouse/click",
                json={"x": x, "y": y, "button": button}
            )
            
            if not response.ok:
                print(f"OS-level click failed, falling back to standard click")
                super().click(x, y, button)
                
        except Exception as error:
            print(f"OS-level click failed, falling back: {error}")
            super().click(x, y, button)

    def type(self, text: str) -> None:
        """
        Type text using OS-level control with realistic delays.
        """
        try:
            response = requests.post(
                f"{self.base_url}/keyboard/type",
                json={"text": text, "delay": 30}
            )
            
            if not response.ok:
                print(f"OS-level type failed, falling back to standard type")
                super().type(text)
                
        except Exception as error:
            print(f"OS-level type failed, falling back: {error}")
            super().type(text)

    def keypress(self, keys: List[str]) -> None:
        """
        Press keyboard shortcut using OS-level control.
        
        Args:
            keys: List of keys to press simultaneously (e.g., ['Ctrl', 'c']).
        """
        try:
            response = requests.post(
                f"{self.base_url}/keyboard/shortcut",
                json={"keys": keys, "holdTime": 100}
            )
            
            if not response.ok:
                print(f"OS-level keyboard shortcut failed, falling back")
                # Fallback to standard implementation
                for key in keys:
                    self._page.keyboard.down(key)
                for key in reversed(keys):
                    self._page.keyboard.up(key)
                
        except Exception as error:
            print(f"OS-level keyboard shortcut failed: {error}")

Usage with OpenAI Models

The integrated computer use agent works seamlessly with OpenAI’s vision models:

# Initialize the agent with your session
agent = AnchorBrowser(width=1440, height=900, session_id="your-session-id")

# The agent can now:
# 1. Take screenshots for AI analysis
screenshot = agent.screenshot()

# 2. Perform precise clicks based on AI vision
agent.click(x=400, y=300)

# 3. Type text naturally
agent.type("Hello from AI agent!")

# 4. Execute keyboard shortcuts
agent.keypress(['Ctrl', 'l'])  # Focus address bar
agent.keypress(['Ctrl', 'f'])  # Open search

The computer use integration provides automatic fallbacks to standard browser automation if OS-level operations aren’t available, ensuring reliability across different environments.

Limitations and Considerations

Session Requirements

  • Headful Sessions Only: OS-level control requires a visible desktop environment
  • Performance Impact: Screenshots and precise positioning may be slower than DOM-based automation

OS-level control opens up powerful possibilities for AI-driven browser automation, enabling more natural and effective interactions that mirror human behavior while providing the precision needed for reliable automation workflows.