MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Hugging Face Daily Papers Papers

Summary

MementoGUI introduces a plug-in agentic memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving performance on long-horizon tasks with compressed visual and textual representations.

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
Original Article
View Cached Full Text

Cached at: 05/19/26, 02:32 PM

Paper page - MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Source: https://huggingface.co/papers/2605.18652

Abstract

MementoGUI presents a memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving long-horizon task performance through compressed visual and textual representations.

RecentGUI agentshave made substantial progress invisual groundingandaction prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equipsMLLM-basedGUI agentswithMementoCore, a learned controller for onlinememory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an onlinememory-control problem:working memoryselectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, whileepisodic memoryretrieves reusable past trajectories through learned relevance selection.MementoCoremodularizes memory control into specialized operators for step processing,memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making inGUI agents, and designMLLM-basedmetrics forsemantic action matching,task progress, andmemory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improvesGUI agentsover no-history, history-replay, and text-only memory baselines, with largerMementoCorebackbones further strengthening memory-augmented GUI control.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.18652

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18652 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18652 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18652 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Learning to Learn from Multimodal Experience

arXiv cs.AI

This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Hugging Face Daily Papers

MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

rohitg00/agentmemory

GitHub Trending (daily)

agentmemory is an open-source persistent memory layer for AI coding agents (Claude Code, Cursor, Gemini CLI, Codex CLI, etc.) that uses knowledge graphs, confidence scoring, and hybrid search to give agents long-term memory across sessions via MCP, hooks, or REST API. Built on the iii engine, it requires no external databases and exposes 51 MCP tools.

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Hugging Face Daily Papers

WorldMemArena is a new benchmark with 400 multi-session multimodal tasks for evaluating multimodal agent memory, comparing long-context, RAG, and harness-based memory approaches, revealing that better memory writing does not guarantee better performance and that systems struggle with visual evidence.