MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Hugging Face Daily Papers Papers

Summary

MemGUI-Agent introduces proactive context management for long-horizon mobile GUI tasks, using Context-as-Action (ConAct) to maintain critical information. It includes the MemGUI-3K dataset and achieves state-of-the-art performance on MemGUI-Bench and MobileWorld benchmarks with an 8B model.

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.
Original Article
View Cached Full Text

Cached at: 06/24/26, 05:47 AM

Paper page - MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Source: https://huggingface.co/papers/2606.19926

Abstract

MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences.

MLLM-based mobile GUI agentshave made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation toReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, anend-to-end long-horizon mobile GUI agentwith proactivecontext management. MemGUI-Agent is built onContext-as-Action (ConAct), which castscontext managementas first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains threestructured context fields:folded action history,folded UI state, andrecent step record, preserving critical UI facts while keeping context compact. To make proactivecontext managementlearnable across model scales, we constructMemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations forsupervised trainingandoffline analysis. Training an 8B model onMemGUI-3Kproduces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance onMemGUI-Benchand generalizes to the out-of-distributionMobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

View arXiv pageView PDFProject pageGitHub1Add to collection

Models citing this paper1

#### lgy0404/MemGUI-8B-SFT Image-Text-to-Text• 9B• Updated5 days ago • 50

Datasets citing this paper1

#### lgy0404/MemGUI-3K Viewer• Updated5 days ago • 2.96k • 702

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.19926 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Hugging Face Daily Papers

OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.

MemGym: a Long-Horizon Memory Environment for LLM Agents

arXiv cs.CL

MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

arXiv cs.AI

MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.