macOS native

Your AI agent can't see. Fix that.

Agent Vision gives AI agents eyes and hands on your Mac. Screenshot, discover UI elements, click, type, scroll. Any app, any window.

$ brew install rvanbaalen/tap/agent-vision

Works with Claude Code, Codex, Gemini CLI

Up and running in 30 seconds

1

Install

terminal
$ brew install rvanbaalen/tap/agent-vision
2

Teach your agent

terminal
$ agent-vision learn

Outputs the full command reference. Your AI agent now knows every command.

3

Give it a task

"Use agent-vision to capture all states of the iOS Simulator, dark mode and light mode"

Just describe what you want in plain English. That's it.

Example prompts

What you can tell your agent

"Use agent-vision to screenshot the app in dark mode and light mode on the simulator"

Visual QA
See use case →

"Use agent-vision to fill out the signup form and capture each step"

Form testing
See use case →

"Use agent-vision to navigate to Settings > Privacy and check what's enabled"

App navigation
See use case →

"Use agent-vision to QA test the login flow across Safari, Chrome, and Firefox"

Cross-browser
See use case →

"Use agent-vision to control the other Claude session and build this feature"

AI orchestration
See use case →
How it sees

Native macOS APIs. Not browser hacks.

Accessibility API + Vision OCR. Agent Vision uses macOS native APIs to discover every interactive element on screen. Buttons, text fields, links, labels.

No DOM. No browser. No Puppeteer. It reads what macOS already knows about every app's UI, then maps those elements to screen coordinates your AI agent can act on.

Works with any macOS application. Xcode, Figma, Mail, Terminal, your custom Electron app — if it's on screen, Agent Vision can see it.

agent-vision elements
button el-btn-001 "Submit Form"
textfield el-input-002 "Enter email"
link el-link-003 "Documentation"
heading el-heading-004 "Account Settings"
image el-img-005 "User avatar"

Scan. Act. Re-scan.

1

start

Select a screen region.

agent-vision start --region 0,0,1440,900
2

capture

Screenshot the region.

agent-vision capture --session $SID
3

elements

Discover every UI element.

agent-vision elements --session $SID
4

control

Click, type, scroll.

agent-vision click --element el-btn-001
5

re-scan

Screen changed. Loop.

agent-vision capture --session $SID
Comparison

Why not Puppeteer?

comparison
Capability Agent Vision Puppeteer AppleScript
Any macOS app ~
Native UI element discovery ~
Screenshot + OCR
Focus-free interaction
Works without browser
Coordinate-accurate clicks
Session-based

Puppeteer and Playwright only work in browsers. AppleScript works with some native apps but can't discover elements reliably or interact with coordinates. Agent Vision does both — for every app on your Mac.

Works with your agent

Claude Code

terminal
# add the skill marketplace
$ claude plugin marketplace add rvanbaalen/skills
# install the skill
$ claude plugin install use-agentvision

Install the skill and Claude automatically knows every command, workflow, and best practice for agent-vision.

Codex

codex session
$ codex
> Use agent-vision to check
  the login form in Simulator

Any agent that can run shell commands can use Agent Vision. No special integration needed.

Gemini CLI

gemini session
$ gemini
> Screenshot the Figma canvas
  and list all text elements

Works the same way. Start a session, capture, discover, act. The interface is the CLI.

Deep dives

Explore use cases

QA Testing with AI Agents

Let your AI agent test any native app. It finds buttons, fills forms, verifies states, and reports bugs — across apps that browser-based tools can't touch.

Read more →

Form Automation with AI Agents

Discover input fields in any application, type values, tab between them, submit. No app-specific scripting required.

Read more →

Visual Feedback Loops for AI Agents

Capture a screenshot, analyze the visual state, decide what to do next. The scan-act-rescan loop gives AI agents a real-time visual feedback loop.

Read more →

Multi-App Workflows with AI Agents

Copy data from a spreadsheet, paste into a web form, verify in a database tool. Agent Vision bridges the gaps between apps that were never designed to talk to each other.

Read more →

Control Claude with Claude

One Claude controls another Claude through its terminal. AI agents orchestrating AI agents through visual interfaces. We built this site this way.

Read more →

Automate Any Web App Without an API

Point Agent Vision at a browser window and control web apps that have no API. Jira, Notion, Google Forms, legacy admin panels. No API keys needed.

Read more →

Test iOS Apps in the Simulator

Control the iOS Simulator without Appium or XCUITest. Tap, swipe, type, and verify — all through the macOS screen. Zero test framework setup.

Read more →

Chain Desktop Apps into AI Workflows

Zapier for your actual desktop. Read email in Mail.app, create Calendar events, post to Slack, log in spreadsheets. No app needs integrations.

Read more →

Research Preview

Agent Vision is a research preview. It is not safe for use in any production environment. It is meant as a productivity assistance tool. While some guardrails exist, it is not safe to run unsupervised. Use at your own risk.