Accessibility API and Set-of-Marks: making computer-use agents more reliable

Reddit r/ArtificialInteligence 05/11/26, 05:02 AM Tools

computer-use ai-agents open-source accessibility-api automation desktop-automation

Summary

The article introduces Opendesk, an open-source tool that enhances the reliability of computer-use agents by leveraging native accessibility APIs to identify interactive elements, replacing error-prone pixel-coordinate guessing.

Opendesk: Give any AI agent eyes + hands on your desktop. I was experimenting with computer-use capabilities from different models, but I wanted to keep using Claude Code and my own agentic harness to automate real desktop tasks, with an improved accuracy using my custom algorithm. Now you can let an agent control your entire desktop: mouse + keyboard included, to perform real workflows and interact with apps and websites more accurately. Examples: • “Open Spotify and play a lofi playlist” • “Go to Twitter and like the first 3 posts on my feed” • “Fill out this form on Chrome” You can use opendesk for the following as well: 1. Learn & Replay The agent can watch what you do on your screen and replay the whole task later. Example: Record yourself logging into a dashboard and exporting a report — it can repeat it anytime on command. 2) Scheduling Run computer-use tasks automatically at a specific time. Example: Every morning at 9am, open Gmail and summarize unread emails. Reason to build: Most computer-use demos work by feeding a screenshot to an LLM and asking it to output pixel coordinates. This works surprisingly often but fails in predictable ways: Retina scaling, window repositioning, UI density, and any layout change break it. The approach I've been exploring in opendesk is: query the platform's native accessibility API first (AppleScript on macOS, AT-SPI2 on Linux, UI Automation on Windows), get the actual interactive elements with their labels and bounding boxes, then draw numbered chips on those elements before the screenshot ever reaches the LLM. The model never guesses coordinates. It reasons about what to do and references elements by their mark number. The system already knows exactly where mark 7 is. Mouse coordinates become a fallback for elements with no accessible label — canvas areas, video players, games. Another idea in the same vein: when replaying a recorded workflow, don’t replay coordinates. Store the trajectory as a sequence of events and screenshots, and at replay time feed that as context to the LLM, which re executes it against the current screen state. This makes replay adaptive rather than brittle. Waiting for feedback from the community! 😃 Github: [https://github.com/vitalops/opendesk](https://github.com/vitalops/opendesk)

Original Article

Accessibility API and Set-of-Marks: making computer-use agents more reliable

Similar Articles

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

@dair_ai: Outstanding paper on computer-using agents. (bookmark it) Computer-using agents drive real software through the screen,…

the accessibility tree gotchas that kept breaking my desktop agent

@EEEEYHN: https://x.com/EEEEYHN/status/2057397813999456759

On the Reliability of Computer Use Agents

Submit Feedback

Similar Articles

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

@dair_ai: Outstanding paper on computer-using agents. (bookmark it) Computer-using agents drive real software through the screen,…

the accessibility tree gotchas that kept breaking my desktop agent

@EEEEYHN: https://x.com/EEEEYHN/status/2057397813999456759

On the Reliability of Computer Use Agents