MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Summary
MemGUI-Agent introduces proactive context management for long-horizon mobile GUI tasks, using Context-as-Action (ConAct) to maintain critical information. It includes the MemGUI-3K dataset and achieves state-of-the-art performance on MemGUI-Bench and MobileWorld benchmarks with an 8B model.
View Cached Full Text
Cached at: 06/24/26, 05:47 AM
Paper page - MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Source: https://huggingface.co/papers/2606.19926
Abstract
MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences.
MLLM-based mobile GUI agentshave made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation toReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, anend-to-end long-horizon mobile GUI agentwith proactivecontext management. MemGUI-Agent is built onContext-as-Action (ConAct), which castscontext managementas first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains threestructured context fields:folded action history,folded UI state, andrecent step record, preserving critical UI facts while keeping context compact. To make proactivecontext managementlearnable across model scales, we constructMemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations forsupervised trainingandoffline analysis. Training an 8B model onMemGUI-3Kproduces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance onMemGUI-Benchand generalizes to the out-of-distributionMobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.
View arXiv pageView PDFProject pageGitHub1Add to collection
Models citing this paper1
#### lgy0404/MemGUI-8B-SFT Image-Text-to-Text• 9B• Updated5 days ago • 50
Datasets citing this paper1
#### lgy0404/MemGUI-3K Viewer• Updated5 days ago • 2.96k • 702
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.19926 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a plug-in agentic memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving performance on long-horizon tasks with compressed visual and textual representations.
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
MobileGym is a browser-based simulation platform for mobile GUI agent research, featuring deterministic state evaluation and scalable parallel execution. It includes a benchmark of 416 tasks and demonstrates gains using GRPO on Qwen3-VL-4B.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments
OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.