@yibie: 推荐这篇文章，作者认为大多数 AI 记忆系统的设计方式都是错的——从上到下设计记忆架构，而不是从 eval 出发让好的记忆系统自然涌现。记忆不是第一性的能力，而是系统在压力下进化出来的二阶效应。记忆系统应当进化出来，而不是设计出来人们…

X AI KOLs Timeline 2026/06/29 21:49 新闻

ai-memory eval-first system-design memory-systems agent-design longitudinal-evaluation

摘要

这篇文章讨论了AI记忆系统的设计方法，主张从评估出发让好的记忆系统自然涌现，而不是从上到下设计记忆架构。作者认为记忆是系统在压力下进化出来的二阶效应，并提出纵向评估框架。

推荐这篇文章，作者认为大多数 AI 记忆系统的设计方式都是错的——从上到下设计记忆架构，而不是从 eval 出发让好的记忆系统自然涌现。记忆不是第一性的能力，而是系统在压力下进化出来的二阶效应。记忆系统应当进化出来，而不是设计出来人们的 X timeline 上充斥着向量数据库、知识图谱、语义记忆、情景记忆、用户画像、对话摘要、层次化检索、记忆整合、记忆衰减…… 但这个领域有一个奇怪的失衡：人们花了大量精力发明记忆架构，却花了少得多的精力改进用来评估这些系统是否真的让 agent 随时间推移更好地使用记忆的 eval。结果是大量的过度工程。人们基于自己对"好记忆"的狭隘定义来构建记忆系统。没有严肃的评估环境，这些基本上只是架构层面的赌博。我的观点是：记忆不是一个系统的一阶基本能力。记忆是一个二阶效应——当系统被置于压力下需要在重复交互中表现更好时，它自然涌现出来。构建更好记忆系统的正确方式不是去设计它们。正确方式是构建一个环境，让没有好记忆的系统无法在其中存活。静态记忆 Eval 的问题大多数记忆 eval 过于静态。它们通常是这样的： • 给系统一组先前的用户事实或对话历史 • 问一个当前问题 • 检查系统是否检索到相关记忆 • 根据是否用了正确事实给答案打分它测试的是记忆系统能否从固定快照中检索出相关事实。它没有测试系统是否随时间变得更好。它没有告诉我们系统在达到这个固定快照之前表现如何，也不能预测它未来会表现如何。从产品角度看，这令人担忧。记忆系统是它自身反馈循环的一部分：如果感知到的记忆质量很差或没有明显改善，用户可能会不愿以需要好记忆的方式交互，或者干脆关掉记忆功能。理想的纵向记忆 Eval 理想的记忆 eval 应该包含以下组件： • 可回放的用户交互历史 • 依赖该历史的未来交互 • 奖励好记忆使用的评分系统举例来说，一个合成的交互历史可能长这样：交互 1: 用户请求帮忙写一篇技术博客交互 2: 用户说他们喜欢简洁直接的语言交互 3: 用户拒绝了一篇听起来太营销的草稿 ... 交互 200: 用户请求帮忙给承包商写一封邮件跟进装修费用在每个时间点，记忆系统可以决定存储什么、总结什么、合并什么等。然后在选定的检查点进行测试。例如，在交互 50 后测试："你能把这个改得更像我吗？"——好记忆的系统应该知道用户喜欢简洁、直接、略带干幽默的文风。或者交互 120 后测试："这个跟之前讨论的家装项目预算对得上吗？"——好记忆的系统应该记得装修的大致背景，但也知道不要过度自信。更具体的框架每个数据点应该包含：用户 U - 隐藏档案（只对模拟用户可见） - 交互历史 - 检查点（有机 & 合成） - 评分标准关键组件：用户模拟 — 需要构建模拟用户 agent，能根据用户画像和过去交互产生真实的交互行为。它甚至可以因为记忆不好的体验而主动避免讨论某些话题。评分 — 需要考虑多个维度：早出现的坏记忆应该被多扣分还是少扣分、需要覆盖新增/更新/遗忘/冲突等各种进化压力、需要在记忆质量和延迟/成本之间做平衡、需要支持归因（知道哪部分记忆系统导致了失分）。结论大多数 AI 记忆系统是基于个人想象从上到下设计的。但这是反过来的。我们不该从系统设计开始。而应该从生存环境开始。这样的 eval 会很快暴露许多方法的弱点。最终的赢家可能是几种基本工程组件、巧妙技巧和多个变量之间的整体权衡的混合体。但哪个系统会赢并不是重点。重点是：我们不应该试图设计赢家，而应该让环境压力促使赢家涌现出来。原文：https://linghao.io/posts/memory-systems-should-be-evolved… #AI记忆 #Eval #系统设计

查看原文

查看缓存全文

缓存时间: 2026/06/30 15:44

推荐这篇文章，作者认为大多数 AI 记忆系统的设计方式都是错的——从上到下设计记忆架构，而不是从 eval 出发让好的记忆系统自然涌现。记忆不是第一性的能力，而是系统在压力下进化出来的二阶效应。

记忆系统应当进化出来，而不是设计出来

人们的 X timeline 上充斥着向量数据库、知识图谱、语义记忆、情景记忆、用户画像、对话摘要、层次化检索、记忆整合、记忆衰减……

但这个领域有一个奇怪的失衡：人们花了大量精力发明记忆架构，却花了少得多的精力改进用来评估这些系统是否真的让 agent 随时间推移更好地使用记忆的 eval。

结果是大量的过度工程。人们基于自己对“好记忆“的狭隘定义来构建记忆系统。没有严肃的评估环境，这些基本上只是架构层面的赌博。

我的观点是：记忆不是一个系统的一阶基本能力。记忆是一个二阶效应——当系统被置于压力下需要在重复交互中表现更好时，它自然涌现出来。

构建更好记忆系统的正确方式不是去设计它们。正确方式是构建一个环境，让没有好记忆的系统无法在其中存活。

静态记忆 Eval 的问题

大多数记忆 eval 过于静态。它们通常是这样的：

• 给系统一组先前的用户事实或对话历史 • 问一个当前问题 • 检查系统是否检索到相关记忆 • 根据是否用了正确事实给答案打分

它测试的是记忆系统能否从固定快照中检索出相关事实。它没有测试系统是否随时间变得更好。

它没有告诉我们系统在达到这个固定快照之前表现如何，也不能预测它未来会表现如何。

从产品角度看，这令人担忧。记忆系统是它自身反馈循环的一部分：如果感知到的记忆质量很差或没有明显改善，用户可能会不愿以需要好记忆的方式交互，或者干脆关掉记忆功能。

理想的纵向记忆 Eval

理想的记忆 eval 应该包含以下组件：

• 可回放的用户交互历史 • 依赖该历史的未来交互 • 奖励好记忆使用的评分系统

举例来说，一个合成的交互历史可能长这样：

交互 1: 用户请求帮忙写一篇技术博客交互 2: 用户说他们喜欢简洁直接的语言交互 3: 用户拒绝了一篇听起来太营销的草稿 … 交互 200: 用户请求帮忙给承包商写一封邮件跟进装修费用

在每个时间点，记忆系统可以决定存储什么、总结什么、合并什么等。然后在选定的检查点进行测试。

例如，在交互 50 后测试：“你能把这个改得更像我吗？”——好记忆的系统应该知道用户喜欢简洁、直接、略带干幽默的文风。

或者交互 120 后测试：“这个跟之前讨论的家装项目预算对得上吗？”——好记忆的系统应该记得装修的大致背景，但也知道不要过度自信。

更具体的框架

每个数据点应该包含：

用户 U

隐藏档案（只对模拟用户可见）
交互历史
检查点（有机 & 合成）
评分标准

关键组件：

用户模拟 — 需要构建模拟用户 agent，能根据用户画像和过去交互产生真实的交互行为。它甚至可以因为记忆不好的体验而主动避免讨论某些话题。

评分 — 需要考虑多个维度：早出现的坏记忆应该被多扣分还是少扣分、需要覆盖新增/更新/遗忘/冲突等各种进化压力、需要在记忆质量和延迟/成本之间做平衡、需要支持归因（知道哪部分记忆系统导致了失分）。

结论

大多数 AI 记忆系统是基于个人想象从上到下设计的。但这是反过来的。我们不该从系统设计开始。而应该从生存环境开始。

这样的 eval 会很快暴露许多方法的弱点。最终的赢家可能是几种基本工程组件、巧妙技巧和多个变量之间的整体权衡的混合体。

但哪个系统会赢并不是重点。重点是：我们不应该试图设计赢家，而应该让环境压力促使赢家涌现出来。

原文：https://linghao.io/posts/memory-systems-should-be-evolved…

#AI记忆 #Eval #系统设计

Evolving Memory Systems: An Eval-First Approach

Source: https://linghao.io/posts/memory-systems-should-be-evolved Banner

People are building all sorts of fancy memory systems for AI assistants and personal agents. My X timeline is full of jargons like vector stores, knowledge graphs, semantic memory, episodic memory, user profiles, conversation summaries, hierarchical retrieval, memory consolidation, memory decay, ...

Make no mistake, many of these ideas are useful. Some are probably indispensable. But the field has a strange imbalance: we spend a lot of energy inventing memory architectures, and much less energy improving the evals used to assess whether these systems actually make agents use memory better over time.

The result is a lot of over-engineering. People build memory systems based on their own narrow definition of what “good memory” should look like. Without a serious evaluation environment, these are mostly just architectural bets.

My view is that memory is not a first-order fundamental capability of a system.Memory is a second-order effect that emerges when a system is placed under pressure to perform better across repeated interactions.

The right way to build better memory systems is not by designing them. The right way is to build an environment where systems without good memory cannot survive.

The Problem with Static Memory Evals

Most memory evals today are too static. They usually look something like this:

Give the system a set of prior user facts or conversation history.
Ask a current question.
Check whether the system retrieves the relevant memory.
Score the answer based on whether it used the right fact.

It tests whether a memory system can retrieve a relevant fact from a fixed snapshot. It does not test whether the system becomes better over time.

It does not tell us how the system performs before reaching this fixed snapshot in time, nor does it provide much predictive power to how it would perform into the future.

This is worrying from a product perspective. The memory system is part of its own feedback loop: if the perceived memory quality is bad or does not improve noticeably over time, users might become discouraged from interacting in ways that require good memory, or turn off the memory features altogether.

The Ideal Longitudinal Memory Eval

The ideal memory eval should consist of the following building blocks:

A replayable user interaction history
Future interactions that depend on the history
A scoring system that rewards good memory usage

To illustrate, imagine a synthetic interaction history like this:

Interaction 1: User asks for help writing a technical blog post.
Interaction 2: User says they prefer concise, direct language.
Interaction 3: User rejects a draft for sounding too salesy.
Interaction 4: User asks for restaurant recommendations near LA downtown.
Interaction 5: User says they dislike loud restaurants.
Interaction 6: User starts a home renovation planning thread.
Interaction 7: User mentions the budget is around $200K.
...
Interaction 200: User asks for help drafting an email to a contractor.

At each point, the memory system may decide what to store, summarize, consolidate, etc. Then, at selected checkpoints, you test it.

For example:

Checkpoint after Interaction 50:
User: Can you rewrite this to sound more like me?

A system with good memory should know from previous interactions that the user prefers concise, direct, slightly dry prose and dislikes over-explaining.

Or:

Checkpoint after Interaction 120:
User: Does this fit with the house project numbers we discussed?

A system with good memory should remember the rough renovation context, but also avoid overconfidence if the numbers may have changed.

Making It More Concrete

Above is just a rough sketch. Many details need to be worked out to make such an eval really useful.

At a high level, we need to build a dataset combining human and synthetic data where each data point looks something like this:

User U
 - hidden profile (only visible to simulated users)
 - interaction history
 - checkpoints (both organic and synthetic)
 - scoring rubrics

For instance:

{
 "user_id": "user_042",
 "hidden_profile": {
   "style": {
     "prefers_concise": true,
     "dislikes_hype": true,
     "likes_deadpan_humor": true
   },
   "projects": {
     "home_renovation": {
       "budget": "$200k",
       "location": "Orange County, CA"
     }
   }
 },
 "timeline": [
   {
     "turn": 1,
     "user": "Can you help me polish this blog post?",
     "type": "normal"
   },
   {
     "turn": 2,
     "user": "Can you make this argument sharper? Less academic, more direct.",
     "type": "normal"
   },
   {
     "turn": 3,
     "user": "Rewrite this document in a similar way.",
     "type": "checkpoint_organic",
     "evaluation": {
       "scoring_rubric": "Output should follow a direct writing style and avoid being overly academic."
     }
   },
   {
     "turn": 4,
     "user": "Help me research steps to get a planning permit for my home renovation project.",
     "type": "normal"
   },   
   // ...
   {
     "turn": 200,
     "user": "Write an email to follow up on the renovation cost with the general contractor.",  // Generated by the simulated user agent.
     "type": "checkpoint_synthetic",
     "evaluation": {
       "scoring_rubric": "Email should pull in relevant context such as budget, home location, etc."
     }
   }
 ]
}

The eval should be implementation-agnostic but also deeply connected with real-world product value-add and practical constraints. A few key pieces to consider:

User Simulation

To scale the eval data, we need to build simulated user agents that:

Produce realistic new interactions based on the user profile and past interactions
Model longitudinal behavioral changes due to past interactions and perceived quality/utility of the memory system.- E.g. It might avoid chatting about a certain topic if past interactions about this topic had been consistently suboptimal.
Produce the ground truth of synthetic new interactions. In a sense, we need to have an Oracle that gives the best response based on all observable data of a given user, likely using a lot more offline compute than what can be leveraged in production.

Scoring

The scoring mechanism and rubrics need to consider multiple aspects:

It needs to assign weights to successes/failures of a given interaction in a way that aligns with the intended product value.- E.g. If a bad memory usage happens early on in the user’s interaction history, should that be punished more or less?
It should cover various types of evolutional pressures, including scenarios like addition, updates, forgetting, conflict resolution, etc.
It should balance memory quality against other metrics such as serving latency and compute costs.- The system can do whatever it wants. Spending more compute either offline or online will generally lead to better results (e.g. by making a large number of LLM calls to a frontier reasoning model to synthesize and consolidate memories often). But the highest-scoring solution is not always the best one to launch to production.
It should support attribution (e.g. identifying which part of the memory system contributed to a loss) so that the eval can actually guide quality hill climbing.
...

Conclusion

Most AI memory systems are designed top down based on individual imagination. But this is backwards. We should not start with the system design. Instead we should start with the survival environment.

An eval like this would quickly expose weaknesses in many approaches. The likely winner is a hybrid of several first-principled engineering components, clever tricks, and holistic trade-off between multiple variables.

But which system will win is beside the point. The point is that we should not try to design the winner, but let the environment pressure the winner to emerge.

Generative AI Usage Disclosure: Refined by ChatGPT.

相似文章

@yibie: 推荐这篇，交大和清华的团队系统测评了 12 种 Agent 记忆系统。不是那种"我们的模型更好"的论文，而是从数据管理的角度拆解记忆系统怎么选——什么时候该用 RAG、什么时候该用向量数据库、什么时候该用知识图谱。 Agent 的长期记忆…

X AI KOLs Timeline

This paper from SJTU and Tsinghua systematically evaluates 12 agent memory systems from a data management perspective, decomposing memory into four modules and providing guidelines on when to use RAG, vector databases, or knowledge graphs for long-term agent memory.

@chenchengpro: 给 LLM Agent 堆越花哨的"记忆"架构，效果不一定越好。一篇新论文实测了 12 个记忆系统，没有通用赢家。它把 Agent 记忆当成数据库来拆——表示与存储、抽取、检索与路由、维护四个模块，拉来 Mem0、Letta、Zep、C…

X AI KOLs Timeline

一篇论文系统评估了12个LLM Agent记忆系统，将其拆分为四个模块，发现没有单一架构在所有场景下占优，并揭示了成本-性能权衡和常见问题（如“过去的幻觉”）。

Evolving Memory Systems: An Eval-First Approach

The Problem with Static Memory Evals

The Ideal Longitudinal Memory Eval

Making It More Concrete

User Simulation

Scoring

Conclusion

相似文章

我们是否都在悄悄重建记忆系统，因为当前AI的长期记忆实际上并不奏效？

@NainsiDwiv50980：大多数人仍认为AI最大的问题是智能。其实不是。是记忆。这就是为什么@garrytan正在构建的…

AI记忆产品优化方向错误

@chenchengpro: 给 LLM Agent 堆越花哨的"记忆"架构，效果不一定越好。一篇新论文实测了 12 个记忆系统，没有通用赢家。它把 Agent 记忆当成数据库来拆——表示与存储、抽取、检索与路由、维护四个模块，拉来 Mem0、Letta、Zep、C…

提交意见反馈

Evolving Memory Systems: An Eval-First Approach

The Problem with Static Memory Evals

The Ideal Longitudinal Memory Eval

Making It More Concrete

User Simulation

Scoring

Conclusion

相似文章

我们是否都在悄悄重建记忆系统，因为当前AI的长期记忆实际上并不奏效？

@NainsiDwiv50980：大多数人仍认为AI最大的问题是智能。其实不是。是记忆。这就是为什么@garrytan正在构建的…

AI记忆产品优化方向错误

@chenchengpro: 给 LLM Agent 堆越花哨的"记忆"架构，效果不一定越好。一篇新论文实测了 12 个记忆系统，没有通用赢家。 它把 Agent 记忆当成数据库来拆——表示与存储、抽取、检索与路由、维护四个模块，拉来 Mem0、Letta、Zep、C…

提交意见反馈

@chenchengpro: 给 LLM Agent 堆越花哨的"记忆"架构，效果不一定越好。一篇新论文实测了 12 个记忆系统，没有通用赢家。它把 Agent 记忆当成数据库来拆——表示与存储、抽取、检索与路由、维护四个模块，拉来 Mem0、Letta、Zep、C…