Agent Vision gives AI agents eyes and hands on your Mac. Screenshot, discover UI elements, click, type, scroll — any app, any window.
macOS only · lightweight · works with Claude Code, Codex, Gemini CLI
Accessibility API + Vision OCR. Agent Vision uses macOS native APIs to discover every interactive element on screen. Buttons, text fields, links, labels — anything the OS knows about.
No DOM. No browser. No Puppeteer. It reads what macOS already knows about every app's UI, then maps those elements to screen coordinates your AI agent can act on.
Works with any macOS application. Xcode, Figma, Mail, Terminal, your custom Electron app — if it's on screen, Agent Vision can see it.
Let your AI agent test any native app. It finds buttons, fills forms, verifies states, and reports bugs — across apps that browser-based tools can't touch.
agent-vision elements --session $SID --filter button Read more → Discover input fields in any application, type values, tab between them, submit. No app-specific scripting required.
agent-vision type --element el-input-002 --text "hello@example.com" Read more → Capture a screenshot, analyze the visual state, decide what to do next. The scan-act-rescan loop gives AI agents a real-time visual feedback loop.
agent-vision capture --session $SID --format png Read more → Copy data from a spreadsheet, paste into a web form, verify in a database tool. Agent Vision bridges the gaps between apps that were never designed to talk to each other.
agent-vision click --element el-btn-003 --session $SID Read more → One Claude controls another Claude through its terminal. AI agents orchestrating AI agents through visual interfaces. We built this site this way.
agent-vision type --text "fix the header" --session $INNER_SID Read more → Point Agent Vision at a browser window and control web apps that have no API. Jira, Notion, Google Forms, legacy admin panels. No API keys needed.
agent-vision click --element el-link-005 --session $SID Read more → Control the iOS Simulator without Appium or XCUITest. Tap, swipe, type, and verify — all through the macOS screen. Zero test framework setup.
agent-vision drag --from 215,700 --to 215,300 --session $SID Read more → Zapier for your actual desktop. Read email in Mail.app, create Calendar events, post to Slack, log in spreadsheets. No app needs integrations.
agent-vision start --region 0,0,800,600 --name mail Read more → | Capability | Agent Vision | Puppeteer | AppleScript |
|---|---|---|---|
| Any macOS app | ✓ | — | ~ |
| Native UI element discovery | ✓ | — | ~ |
| Screenshot + OCR | ✓ | ✓ | — |
| Focus-free interaction | ✓ | — | — |
| Works without browser | ✓ | — | ✓ |
| Coordinate-accurate clicks | ✓ | ✓ | — |
| Session-based | ✓ | ✓ | — |
Puppeteer and Playwright only work in browsers. AppleScript works with some native apps but can't discover elements reliably or interact with coordinates. Agent Vision does both — for every app on your Mac.
{ "tools": ["agent-vision"] }
Claude Code can call agent-vision directly as a shell command. Give it eyes on your simulator, IDE, or any app.
$ codex > Use agent-vision to check the login form in Simulator
Any agent that can run shell commands can use Agent Vision. No special integration needed.
$ gemini > Screenshot the Figma canvas and list all text elements
Works the same way. Start a session, capture, discover, act. The interface is the CLI.
Requires macOS 13+ · No dependencies · ~4MB