RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Hugging Face Daily Papers Papers

Summary

RoboMemArena introduces a large-scale benchmark for evaluating robotic memory across 26 complex tasks with real-world validation, alongside PrediMem, a dual-system vision-language-action model that improves memory management through predictive coding.

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 10:53 AM

Paper page - RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Source: https://huggingface.co/papers/2605.10921 Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

RoboMemArena presents a large-scale robotic memory benchmark with diverse tasks and real-world evaluation, while PrediMem demonstrates improved memory management through a dual-system vision-language architecture with predictive coding.

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation withoutreal-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages avision-language model(VLM) to design and compose subtasks, generates full trajectories throughatomic functions, and providesmemory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, adual-system VLAin which a high-level VLM planner manages amemory bankwith recent andkeyframe buffersand uses apredictive coding headto improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.10921

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10921 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10921 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10921 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MEME: Multi-entity & Evolving Memory Evaluation

Hugging Face Daily Papers

The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Hugging Face Daily Papers

RoboLab is a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies, introducing the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes. It enables scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations to assess true generalization capabilities.