SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
Summary
SubtleMemory is a benchmark for evaluating AI agents' fine-grained relational memory discrimination in long-horizon interactions, consisting of 1,522 instances over 10 long histories. It reveals limitations in current memory systems for preserving and utilizing nuanced memory relationships.
View Cached Full Text
Cached at: 06/08/26, 03:29 AM
Paper page - SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
Source: https://huggingface.co/papers/2606.05761
Abstract
SubtleMemory benchmark evaluates AI agents’ ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships.
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend onmemory relationsrather than isolated recall. Existinglong-term memorybenchmarks rarely probe how agents preserve and utilize such relations duringdownstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grainedrelational memorydiscrimination in long-running AI agents. SubtleMemory constructs relation-controlledlatent semantic artifactswhose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grainedrelational memorydiscrimination. We further introduce diagnostic protocols that reveal distinct capability profiles acrossmemory preservation, retrieval, anddownstream reasoningstages.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.05761
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05761 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05761 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05761 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
MEME: Multi-entity & Evolving Memory Evaluation
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
Memory for agents ain't here yet
A critique of current memory solutions for AI agents, arguing that RAG wrappers and similar approaches fail to address core issues of model bias and context bloat.
MemRefine: LLM-Guided Compression for Long-Term Agent Memory
MemRefine is an LLM-guided framework for compressing long-term agent memory under fixed storage budgets, using similarity for candidate pairing and an LLM judge for factual deletion/merge decisions, outperforming rule-based baselines on benchmarks.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.