The article introduces Opendesk, an open-source tool that enhances the reliability of computer-use agents by leveraging native accessibility APIs to identify interactive elements, replacing error-prone pixel-coordinate guessing.
Opendesk: Give any AI agent eyes + hands on your desktop. I was experimenting with computer-use capabilities from different models, but I wanted to keep using Claude Code and my own agentic harness to automate real desktop tasks, with an improved accuracy using my custom algorithm. Now you can let an agent control your entire desktop: mouse + keyboard included, to perform real workflows and interact with apps and websites more accurately. Examples: • “Open Spotify and play a lofi playlist” • “Go to Twitter and like the first 3 posts on my feed” • “Fill out this form on Chrome” You can use opendesk for the following as well: 1. Learn & Replay The agent can watch what you do on your screen and replay the whole task later. Example: Record yourself logging into a dashboard and exporting a report — it can repeat it anytime on command. 2) Scheduling Run computer-use tasks automatically at a specific time. Example: Every morning at 9am, open Gmail and summarize unread emails. Reason to build: Most computer-use demos work by feeding a screenshot to an LLM and asking it to output pixel coordinates. This works surprisingly often but fails in predictable ways: Retina scaling, window repositioning, UI density, and any layout change break it. The approach I've been exploring in opendesk is: query the platform's native accessibility API first (AppleScript on macOS, AT-SPI2 on Linux, UI Automation on Windows), get the actual interactive elements with their labels and bounding boxes, then draw numbered chips on those elements before the screenshot ever reaches the LLM. The model never guesses coordinates. It reasons about what to do and references elements by their mark number. The system already knows exactly where mark 7 is. Mouse coordinates become a fallback for elements with no accessible label — canvas areas, video players, games. Another idea in the same vein: when replaying a recorded workflow, don’t replay coordinates. Store the trajectory as a sequence of events and screenshots, and at replay time feed that as context to the LLM, which re executes it against the current screen state. This makes replay adaptive rather than brittle. Waiting for feedback from the community! 😃 Github: [https://github.com/vitalops/opendesk](https://github.com/vitalops/opendesk)
A developer shares four common accessibility tree pitfalls that break desktop agents: stale PIDs after app switches, modal sheets intercepting clicks, multi-monitor coordinate issues, and silent failures. Solutions include detecting frontmost app changes, explicit modal checks, and correct coordinate targeting.
This article explains in detail how to use Accessibility API, CGEvent.postToPid, and event tap technology on macOS to enable an AI agent to operate windows in the background without disturbing the user, thus supporting the coexistence of two mouse pointers.
A preprint analyzing why computer-use agents succeed once but fail on repeated executions, attributing unreliability to execution stochasticity, task ambiguity, and behavioral variability, and advocating repeated evaluation and stable strategies.
OpenAI demonstrates the 'Computer Use' feature in Codex, allowing the AI to directly interact with local GUI applications on macOS using an accessibility framework and the fast Spark model for non-blocking, high-speed automation.
A developer reverse-engineered OpenAI's Codex Computer Use to build pi-computer-use, an open-source, model-agnostic macOS automation tool featuring ax-first navigation and vision fallback for supported models.