UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Papers with Code Trending Papers

Summary

UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/09/26, 12:29 AM

Paper page - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Source: https://huggingface.co/papers/2509.02544 Published on Sep 2, 2025

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

UI-TARS-2, a native GUI-centered agent model, addresses challenges in data scalability, multi-turn reinforcement learning, and environment stability, achieving significant improvements over its predecessor and strong baselines across various benchmarks.

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turnreinforcement learning(RL), the limitations ofGUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a nativeGUI-centered agent model that addresses these challenges through a systematic training methodology: adata flywheelfor scalable data generation, a stabilizedmulti-turn RLframework, a hybridGUIenvironment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. OnGUIbenchmarks, it reaches 88.2 onOnline-Mind2Web, 47.5 onOSWorld, 50.6 onWindowsAgentArena, and 73.3 onAndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) onLMGame-Bench. Additionally, the model can generalize tolong-horizon information-seeking tasksandsoftware engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2’s potential to advance the state ofGUIagents and exhibit strong generalization to real-world interactive scenarios.

View arXiv pageView PDFProject pageGitHub10.3kautoAdd to collection

Get this paper in your agent:

hf papers read 2509\.02544

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### meituan/EvoCUA-32B-20260105 33B• UpdatedMar 31 • 884 • 25 #### meituan/EvoCUA-8B-20260105 9B• UpdatedMar 31 • 2.92k • 14

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.02544 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2509.02544 in a Space README.md to link it from this page.

Collections including this paper22

Browse 22 collections that include this paper

Similar Articles

bytedance/UI-TARS-desktop

GitHub Trending (daily)

ByteDance released TARS, a multimodal AI agent stack comprising Agent TARS (a CLI/Web UI-based general AI agent for GUI, browser, and terminal tasks) and UI-TARS Desktop (a native desktop application powered by the UI-TARS model for local and remote computer/browser automation). The stack integrates multimodal LLMs with MCP tools for human-like task completion.

Computer-Using Agent

OpenAI Blog

OpenAI introduced the Computer-Using Agent (CUA), a model combining GPT-4o's vision with reinforcement learning to interact with GUIs like a human, powering the new Operator agent. CUA sets new state-of-the-art benchmarks including 38.1% on OSWorld and 58.1% on WebArena, and is available as a research preview for ChatGPT Pro users in the US.

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Hugging Face Daily Papers

UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.