MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

arXiv cs.CL Papers

Summary

MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.

arXiv:2606.24595v1 Announce Type: new Abstract: Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited. We argue that long-term memory should instead be evaluated as an auditable post-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory-equipped agent assists simulated users, each carrying a hidden, taxonomy-anchored user-state bank, across a trajectory of leak-controlled tasks, after which that bank is reconstructed from the agent's resulting memory under both full-store and top-k access. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each (1,550 recovery targets) and tests 5 representative memory systems. Testing state-of-the-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities. Task completion nearly saturates, even for a memoryless baseline, while category-balanced recovery stays moderate (about 0.6) and drops further under top-k retrieval. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:47 AM

# MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery
Source: [https://arxiv.org/abs/2606.24595](https://arxiv.org/abs/2606.24595)
[View PDF](https://arxiv.org/pdf/2606.24595)

> Abstract:Long\-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms\. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited\. We argue that long\-term memory should instead be evaluated as an auditable post\-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory\-equipped agent assists simulated users, each carrying a hidden, taxonomy\-anchored user\-state bank, across a trajectory of leak\-controlled tasks, after which that bank is reconstructed from the agent's resulting memory under both full\-store and top\-k access\. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each \(1,550 recovery targets\) and tests 5 representative memory systems\. Testing state\-of\-the\-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities\. Task completion nearly saturates, even for a memoryless baseline, while category\-balanced recovery stays moderate \(about 0\.6\) and drops further under top\-k retrieval\. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth\. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them\.

## Submission history

From: Enze Ma \[[view email](https://arxiv.org/show-email/b91bd510/2606.24595)\] **\[v1\]**Tue, 23 Jun 2026 13:52:46 UTC \(422 KB\)

Similar Articles

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

arXiv cs.AI

MemTrace is a benchmark that evaluates LLM agent memory at the knowledge point level, probing how facts behave under varying memory age, question type, and evidence conditions. It reveals that pooled accuracy hides distinct failure modes, and that the main bottleneck is evidence use rather than retrieval.

MemPro: Agentic Memory Systems as Evolvable Programs

arXiv cs.CL

MemPro is a system-level evolution framework that treats the memory construction–retrieval pipeline as an evolvable program, using an Evolving Agent to iteratively diagnose failures and create improved versions. Experiments on long-horizon benchmarks show consistent improvement over static and prompt-level baselines with favorable performance–cost trade-off.