Accessibility API + Vision OCR. Agent Vision uses macOS native APIs to discover every interactive element on screen. Buttons, text fields, links, labels.

No DOM. No browser. No Puppeteer. It reads what macOS already knows about every app's UI, then maps those elements to screen coordinates your AI agent can act on.

Works with any macOS application. Xcode, Figma, Mail, Terminal, your custom Electron app — if it's on screen, Agent Vision can see it.

agent-vision elements

button el-btn-001 "Submit Form"

textfield el-input-002 "Enter email"

link el-link-003 "Documentation"

heading el-heading-004 "Account Settings"

image el-img-005 "User avatar"

The loop

Scan. Act. Re-scan.

start

Select a screen region.

agent-vision start --region 0,0,1440,900

capture

Screenshot the region.

agent-vision capture --session $SID

elements

Discover every UI element.

agent-vision elements --session $SID

control

Click, type, scroll.

agent-vision click --element el-btn-001

re-scan

Screen changed. Loop.

agent-vision capture --session $SID

Comparison

Why not Puppeteer?

comparison

Capability	Agent Vision	Puppeteer	AppleScript
Any macOS app	✓	—	~
Native UI element discovery	✓	—	~
Screenshot + OCR	✓	✓	—
Focus-free interaction	✓	—	—
Works without browser	✓	—	✓
Coordinate-accurate clicks	✓	✓	—
Session-based	✓	✓	—

Puppeteer and Playwright only work in browsers. AppleScript works with some native apps but can't discover elements reliably or interact with coordinates. Agent Vision does both — for every app on your Mac.

Integrations

Works with your agent

Claude Code

terminal

# add the skill marketplace

$ claude plugin marketplace add rvanbaalen/skills

# install the skill

$ claude plugin install use-agentvision

Install the skill and Claude automatically knows every command, workflow, and best practice for agent-vision.

Codex

codex session

$ codex
> Use agent-vision to check
  the login form in Simulator

Any agent that can run shell commands can use Agent Vision. No special integration needed.

Gemini CLI

gemini session

$ gemini
> Screenshot the Figma canvas
  and list all text elements

Works the same way. Start a session, capture, discover, act. The interface is the CLI.

Deep dives

Explore use cases

QA Testing with AI Agents

Let your AI agent test any native app. It finds buttons, fills forms, verifies states, and reports bugs — across apps that browser-based tools can't touch.

Form Automation with AI Agents

Discover input fields in any application, type values, tab between them, submit. No app-specific scripting required.

Visual Feedback Loops for AI Agents

Capture a screenshot, analyze the visual state, decide what to do next. The scan-act-rescan loop gives AI agents a real-time visual feedback loop.

Multi-App Workflows with AI Agents

Copy data from a spreadsheet, paste into a web form, verify in a database tool. Agent Vision bridges the gaps between apps that were never designed to talk to each other.

Control Claude with Claude

One Claude controls another Claude through its terminal. AI agents orchestrating AI agents through visual interfaces. We built this site this way.

Automate Any Web App Without an API

Point Agent Vision at a browser window and control web apps that have no API. Jira, Notion, Google Forms, legacy admin panels. No API keys needed.

Test iOS Apps in the Simulator

Control the iOS Simulator without Appium or XCUITest. Tap, swipe, type, and verify — all through the macOS screen. Zero test framework setup.

Chain Desktop Apps into AI Workflows

Zapier for your actual desktop. Read email in Mail.app, create Calendar events, post to Slack, log in spreadsheets. No app needs integrations.

Research Preview

Agent Vision is a research preview. It is not safe for use in any production environment. It is meant as a productivity assistance tool. While some guardrails exist, it is not safe to run unsupervised. Use at your own risk.