Agent Vision gives AI agents eyes and hands on your Mac. Screenshot, discover UI elements, click, type, scroll. Any app, any window.
Works with Claude Code, Codex, Gemini CLI
Outputs the full command reference. Your AI agent now knows every command.
Just describe what you want in plain English. That's it.
"Use agent-vision to screenshot the app in dark mode and light mode on the simulator"
Visual QA"Use agent-vision to fill out the signup form and capture each step"
Form testing"Use agent-vision to navigate to Settings > Privacy and check what's enabled"
App navigation"Use agent-vision to QA test the login flow across Safari, Chrome, and Firefox"
Cross-browser"Use agent-vision to control the other Claude session and build this feature"
AI orchestrationAccessibility API + Vision OCR. Agent Vision uses macOS native APIs to discover every interactive element on screen. Buttons, text fields, links, labels.
No DOM. No browser. No Puppeteer. It reads what macOS already knows about every app's UI, then maps those elements to screen coordinates your AI agent can act on.
Works with any macOS application. Xcode, Figma, Mail, Terminal, your custom Electron app — if it's on screen, Agent Vision can see it.
Select a screen region.
agent-vision start --region 0,0,1440,900 Screenshot the region.
agent-vision capture --session $SID Discover every UI element.
agent-vision elements --session $SID Click, type, scroll.
agent-vision click --element el-btn-001 Screen changed. Loop.
agent-vision capture --session $SID | Capability | Agent Vision | Puppeteer | AppleScript |
|---|---|---|---|
| Any macOS app | ✓ | — | ~ |
| Native UI element discovery | ✓ | — | ~ |
| Screenshot + OCR | ✓ | ✓ | — |
| Focus-free interaction | ✓ | — | — |
| Works without browser | ✓ | — | ✓ |
| Coordinate-accurate clicks | ✓ | ✓ | — |
| Session-based | ✓ | ✓ | — |
Puppeteer and Playwright only work in browsers. AppleScript works with some native apps but can't discover elements reliably or interact with coordinates. Agent Vision does both — for every app on your Mac.
Install the skill and Claude automatically knows every command, workflow, and best practice for agent-vision.
$ codex > Use agent-vision to check the login form in Simulator
Any agent that can run shell commands can use Agent Vision. No special integration needed.
$ gemini > Screenshot the Figma canvas and list all text elements
Works the same way. Start a session, capture, discover, act. The interface is the CLI.
Let your AI agent test any native app. It finds buttons, fills forms, verifies states, and reports bugs — across apps that browser-based tools can't touch.
Read more →Discover input fields in any application, type values, tab between them, submit. No app-specific scripting required.
Read more →Capture a screenshot, analyze the visual state, decide what to do next. The scan-act-rescan loop gives AI agents a real-time visual feedback loop.
Read more →Copy data from a spreadsheet, paste into a web form, verify in a database tool. Agent Vision bridges the gaps between apps that were never designed to talk to each other.
Read more →One Claude controls another Claude through its terminal. AI agents orchestrating AI agents through visual interfaces. We built this site this way.
Read more →Point Agent Vision at a browser window and control web apps that have no API. Jira, Notion, Google Forms, legacy admin panels. No API keys needed.
Read more →Control the iOS Simulator without Appium or XCUITest. Tap, swipe, type, and verify — all through the macOS screen. Zero test framework setup.
Read more →Zapier for your actual desktop. Read email in Mail.app, create Calendar events, post to Slack, log in spreadsheets. No app needs integrations.
Read more →Agent Vision is a research preview. It is not safe for use in any production environment. It is meant as a productivity assistance tool. While some guardrails exist, it is not safe to run unsupervised. Use at your own risk.