Tag
MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.
ActiveGraph announces two new papers on agent memory (LongMemEval) and self-improvement regimes, along with reference agents, pack templates, and upcoming meetups in Seattle and San Francisco.
AtomMem introduces a long-term memory system for LLM agents that uses atomic facts as efficient memory units, organizing them into hierarchical event structures and temporal user profiles, achieving state-of-the-art on the LoCoMo benchmark.
Elasticsearch blog post describes building a persistent agent memory layer with three memory types (episodic, semantic, procedural), achieving 0.89 recall on a QA eval with zero tenant leaks using hybrid recall and DLS isolation.
CoreMem proposes a resource-efficient edge-cloud memory architecture for dialogue agents, using Riemannian retrieval with a Fisher-Rao metric and Fisher-guided discrete token distillation to achieve strong accuracy improvements within an 8 GB VRAM budget.
MemTrace is a benchmark that evaluates LLM agent memory at the knowledge point level, probing how facts behave under varying memory age, question type, and evidence conditions. It reveals that pooled accuracy hides distinct failure modes, and that the main bottleneck is evidence use rather than retrieval.
T-Mem is a new long-term conversational memory architecture that enables both descriptive and associative recall, covering scenarios where query and memory share surface features and those where they are connected by latent semantic arcs. It reaches state-of-the-art on the LoCoMo and LoCoMo-Plus benchmarks.
Tencent open-sourced Hy-Memory, a memory plugin for AI agents that provides long-term memory with a 6-layer dual-reasoning framework, reducing token usage by 35% and memory bloat by 70%.
Midas achieves 0.56 recall@k on BEAM 100K and 0.51 on BEAM 500K with zero LLM calls and zero cost, demonstrating efficient long-term memory for agents.
MemRefine is an LLM-guided framework for compressing long-term agent memory under fixed storage budgets, using similarity for candidate pairing and an LLM judge for factual deletion/merge decisions, outperforming rule-based baselines on benchmarks.
Introduces Infini Memory, a maintainable text-based persistent memory architecture for LLM agents that uses topic-structured documents and iterative retrieval to improve long-term memory usage, achieving 64.7% on MemoryAgentBench.
REAL is a reasoning-enhanced graph framework for long-term memory management of LLMs that uses temporal and confidence-aware directed property graphs with non-destructive temporal updates and hybrid beam search retrieval, achieving an average improvement of 22.72%.
A user questions the feasibility of an AI memory manager system that decides what to keep or forget based on importance, reinforcement, and decay.
This paper proposes a training-free, CPU-only retrieval method that fuses BM25 lexical scores with late-interaction dense scores for conversational memory retrieval, achieving up to +17.2 points improvement on LoCoMo Hit@1 over late interaction alone across six encoders. The study provides controlled ablations on pooling operators, reranker effects, and benchmark robustness, framing the gain as a division of labor between dense and lexical signals.
LifeSide is a new benchmark for evaluating AI agents as lifelong digital companions, testing memory tracking, user understanding, privacy control, and emotional companionship across 2,000 personas and 111K tasks in multi-session settings. Results show that even top models fail to sustain accurate user understanding and genuine companionship over long horizons.
SubtleMemory is a benchmark for evaluating AI agents' fine-grained relational memory discrimination in long-horizon interactions, consisting of 1,522 instances over 10 long histories. It reveals limitations in current memory systems for preserving and utilizing nuanced memory relationships.
Garry Tan's gbrain-evals is an open-source test suite for gbrain, an AI agent's long-term memory, with 4 end-to-end evaluations verifying SkillOpt functionality, achieving high recall and precision on multiple benchmarks.
Tencent has open-sourced TencentDB Agent Memory, which solves the AI agent long-context overflow problem through hierarchical memory management (symbolic short-term memory + hierarchical long-term memory). Benchmarks show token consumption reduced by up to 61% and task success rate improved by over 50%.
ByteDance Seed has open-sourced the TaskMem checkpoint, trained on Qwen3-VL-30B-A3B. It uses two-stage reinforcement learning to enable multimodal Agents to learn to generate long-term memory from video streams, achieving significant improvements on benchmarks such as VideoMME and EgoLife.
The author argues that AI agent memory should focus on pruning data rather than hoarding, drawing parallels to human memory types (sensory, short-term, long-term) and suggesting that modeling after human memory can reduce token usage while maintaining high-quality context.