Your AI agent can't see.
Fix that.

Agent Vision gives AI agents eyes and hands on your Mac. Screenshot, discover UI elements, click, type, scroll — any app, any window.

$ brew install agent-vision

macOS only · lightweight · works with Claude Code, Codex, Gemini CLI

// How it sees

Accessibility API + Vision OCR. Agent Vision uses macOS native APIs to discover every interactive element on screen. Buttons, text fields, links, labels — anything the OS knows about.

No DOM. No browser. No Puppeteer. It reads what macOS already knows about every app's UI, then maps those elements to screen coordinates your AI agent can act on.

Works with any macOS application. Xcode, Figma, Mail, Terminal, your custom Electron app — if it's on screen, Agent Vision can see it.

agent-vision elements
button el-btn-001 "Submit Form"
textfield el-input-002 "Enter email"
link el-link-003 "Documentation"
heading el-heading-004 "Account Settings"
image el-img-005 "User avatar"

// The loop

agent-vision workflow
01
start
Select a screen region. Agent Vision locks onto it.
agent-vision start --region 0,0,1440,900
02
capture
Screenshot the region. Get a PNG your agent can see.
agent-vision capture --session $SID
03
elements
Discover every UI element. Buttons, links, inputs, labels.
agent-vision elements --session $SID
04
control
Click, type, scroll. Act on what you found.
agent-vision click --element el-btn-001
05
re-scan
The screen changed. Capture again. Loop.
agent-vision capture --session $SID

// Use cases

QA Testing with AI Agents

Let your AI agent test any native app. It finds buttons, fills forms, verifies states, and reports bugs — across apps that browser-based tools can't touch.

agent-vision elements --session $SID --filter button Read more →

Form Automation with AI Agents

Discover input fields in any application, type values, tab between them, submit. No app-specific scripting required.

agent-vision type --element el-input-002 --text "hello@example.com" Read more →

Visual Feedback Loops for AI Agents

Capture a screenshot, analyze the visual state, decide what to do next. The scan-act-rescan loop gives AI agents a real-time visual feedback loop.

agent-vision capture --session $SID --format png Read more →

Multi-App Workflows with AI Agents

Copy data from a spreadsheet, paste into a web form, verify in a database tool. Agent Vision bridges the gaps between apps that were never designed to talk to each other.

agent-vision click --element el-btn-003 --session $SID Read more →

Control Claude with Claude

One Claude controls another Claude through its terminal. AI agents orchestrating AI agents through visual interfaces. We built this site this way.

agent-vision type --text "fix the header" --session $INNER_SID Read more →

Automate Any Web App Without an API

Point Agent Vision at a browser window and control web apps that have no API. Jira, Notion, Google Forms, legacy admin panels. No API keys needed.

agent-vision click --element el-link-005 --session $SID Read more →

Test iOS Apps in the Simulator

Control the iOS Simulator without Appium or XCUITest. Tap, swipe, type, and verify — all through the macOS screen. Zero test framework setup.

agent-vision drag --from 215,700 --to 215,300 --session $SID Read more →

Chain Desktop Apps into AI Workflows

Zapier for your actual desktop. Read email in Mail.app, create Calendar events, post to Slack, log in spreadsheets. No app needs integrations.

agent-vision start --region 0,0,800,600 --name mail Read more →

// Why not Puppeteer?

comparison
Capability Agent Vision Puppeteer AppleScript
Any macOS app ~
Native UI element discovery ~
Screenshot + OCR
Focus-free interaction
Works without browser
Coordinate-accurate clicks
Session-based

Puppeteer and Playwright only work in browsers. AppleScript works with some native apps but can't discover elements reliably or interact with coordinates. Agent Vision does both — for every app on your Mac.

// Works with

Claude Code

~/.claude/settings.json
{
  "tools": ["agent-vision"]
}

Claude Code can call agent-vision directly as a shell command. Give it eyes on your simulator, IDE, or any app.

Codex

codex session
$ codex
> Use agent-vision to check
  the login form in Simulator

Any agent that can run shell commands can use Agent Vision. No special integration needed.

Gemini CLI

gemini session
$ gemini
> Screenshot the Figma canvas
  and list all text elements

Works the same way. Start a session, capture, discover, act. The interface is the CLI.

Get started

terminal
$ brew install agent-vision

Requires macOS 13+ · No dependencies · ~4MB

Session active 0 elements discovered Region: