RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
Summary
RoboMemArena introduces a large-scale benchmark for evaluating robotic memory across 26 complex tasks with real-world validation, alongside PrediMem, a dual-system vision-language-action model that improves memory management through predictive coding.
View Cached Full Text
Cached at: 05/12/26, 10:53 AM
Paper page - RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
Source: https://huggingface.co/papers/2605.10921 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
RoboMemArena presents a large-scale robotic memory benchmark with diverse tasks and real-world evaluation, while PrediMem demonstrates improved memory management through a dual-system vision-language architecture with predictive coding.
Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation withoutreal-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages avision-language model(VLM) to design and compose subtasks, generates full trajectories throughatomic functions, and providesmemory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, adual-system VLAin which a high-level VLM planner manages amemory bankwith recent andkeyframe buffersand uses apredictive coding headto improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.10921
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10921 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10921 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10921 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MEME: Multi-entity & Evolving Memory Evaluation
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies, introducing the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes. It enables scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations to assess true generalization capabilities.
Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key
The author shares benchmark results for memweave, a Python library for agent memory, achieving 98% Recall@5 on LongMemEval-S using only local embeddings without LLM calls. The post details the methodology and compares performance against mempalace, highlighting stable retrieval across different question types.