Tag
MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.
This paper introduces PRO-CUA, a process-reward optimization framework for training Computer Use Agents (CUAs) using iterative step-level reinforcement learning. The method decouples on-policy environment interaction from policy optimization, enabling dense credit assignment without relying on expert trajectories, and demonstrates effectiveness on live web benchmarks.
AutoRPA is a framework that automatically distills the decision logic of ReAct-style LLM agents into robust, token-efficient RPA functions for repetitive GUI tasks, reducing token usage by 82-96%.
OpenGUI is a tool that allows AI agents to directly operate real Android apps by reading the screen and interacting naturally, rather than relying on APIs or scripts.
The author discusses building a small VLM for desktop GUI automation to move data between apps without APIs, expressing interest in non-coding autonomous use cases for local models.
ToolCUA is a new agent framework that optimizes GUI-tool path selection for computer use agents through staged training and reinforcement learning. It achieves state-of-the-art performance on OSWorld-MCP by effectively interleaving GUI actions and high-level tool calls.
ByteDance open-sourced UI-TARS-desktop, a native desktop GUI agent with 31.4k GitHub stars that uses vision models to control local or remote applications via natural language. The tool runs locally for privacy, supports Windows and macOS, and includes a CLI with streaming output for developers.
Agent S2 is a new compositional framework for computer use agents that achieves state-of-the-art performance on multiple benchmarks by utilizing Mixture-of-Grounding and Proactive Hierarchical Planning.
OpenAI introduced the Computer-Using Agent (CUA), a model combining GPT-4o's vision with reinforcement learning to interact with GUIs like a human, powering the new Operator agent. CUA sets new state-of-the-art benchmarks including 38.1% on OSWorld and 58.1% on WebArena, and is available as a research preview for ChatGPT Pro users in the US.