@yibie: Recommend this article. The author argues that most AI memory systems are designed the wrong way—designing memory architecture from the top down, rather than letting good memory systems emerge naturally from evaluation. Memory is not a first-class ability, but a second-order effect that evolves under pressure. Memory systems should be evolved, not designed. People...

X AI KOLs Timeline News

Summary

This article discusses the design approach for AI memory systems, advocating for letting good memory systems emerge naturally from evaluation rather than designing memory architecture from the top down. The author argues that memory is a second-order effect evolved under pressure and proposes a longitudinal evaluation framework.

Recommend this article. The author argues that most AI memory systems are designed the wrong way—designing memory architecture from the top down, rather than letting good memory systems emerge naturally from evaluation. Memory is not a first-class ability, but a second-order effect that evolves under pressure. Memory systems should be evolved, not designed People's X timelines are filled with vector databases, knowledge graphs, semantic memory, episodic memory, user profiles, conversation summaries, hierarchical retrieval, memory consolidation, memory decay... But there is a strange imbalance in this field: people spend a lot of effort inventing memory architectures, but far less effort improving the evaluations that measure whether these systems actually help agents use memory better over time. The result is a lot of over-engineering. People build memory systems based on their own narrow definition of 'good memory'. Without rigorous evaluation environments, these are essentially architectural bets. My view is: memory is not a first-class primitive capability of a system. Memory is a second-order effect—it naturally emerges when a system is placed under pressure to perform better in repeated interactions. The right way to build better memory systems is not to design them. The right way is to build an environment where systems without good memory cannot survive. The Problem with Static Memory Evals Most memory evals are too static. They usually go like this: • Give the system a set of previous user facts or conversation history • Ask a current question • Check if the system retrieves relevant memories • Score the answer based on whether it used the correct facts It tests whether the memory system can retrieve relevant facts from a fixed snapshot. It does not test whether the system gets better over time. It doesn't tell us how the system performed before reaching that fixed snapshot, nor can it predict how it will perform in the future. From a product perspective, this is concerning. The memory system is part of its own feedback loop: if the perceived memory quality is poor or shows no clear improvement, users may be reluctant to interact in ways that require good memory, or simply turn off the memory feature. The Ideal Longitudinal Memory Eval An ideal memory eval should include the following components: • Replayable user interaction history • Future interactions that depend on that history • A scoring system that rewards good memory use For example, a synthetic interaction history might look like: Interaction 1: User asks for help writing a technical blog post Interaction 2: User says they like concise and direct language Interaction 3: User rejects a draft that sounds too marketing-like ... Interaction 200: User asks for help writing an email to a contractor to follow up on renovation costs At each time point, the memory system can decide what to store, what to summarize, what to merge, etc. Then tests are conducted at selected checkpoints. For example, test after interaction 50: 'Can you make this more like me?' — A system with good memory should know that the user likes concise, direct, and slightly dry humor. Or test after interaction 120: 'Does this match the budget of the home renovation project we discussed earlier?' — A system with good memory should remember the general context of the renovation, but also know not to be overconfident. A More Concrete Framework Each data point should contain: User U - Hidden profile (visible only to the simulated user) - Interaction history - Checkpoints (organic & synthetic) - Scoring criteria Key components: User simulation — Need to build a simulated user agent that can generate realistic interaction behavior based on user personas and past interactions. It could even actively avoid discussing certain topics due to bad memory experiences. Scoring — Need to consider multiple dimensions: whether early bad memories should be penalized more or less, need to cover various evolutionary pressures such as addition/update/forgetting/conflict, need to balance memory quality and latency/cost, need to support attribution (knowing which part of the memory system caused the loss). Conclusion Most AI memory systems are designed from the top down based on personal imagination. But this is backwards. We should not start with system design. We should start with the environment for survival. Such an eval will quickly expose the weaknesses of many approaches. The eventual winner may be a mix of several basic engineering components, clever tricks, and overall trade-offs among multiple variables. But which system wins is not the point. The point is: we should not try to design the winner, but let environmental pressure cause the winner to emerge. Original: https://linghao.io/posts/memory-systems-should-be-evolved… #AImemory #Eval #SystemDesign
Original Article
View Cached Full Text

Cached at: 06/30/26, 03:44 PM

I recommend this article. The author argues that most AI memory systems are designed the wrong way—top-down memory architecture design, instead of starting from evals and letting good memory systems emerge naturally. Memory is not a first-order capability; it’s a second-order effect that systems evolve under pressure.

Memory systems should be evolved, not designed

People’s X timelines are full of vector databases, knowledge graphs, semantic memory, episodic memory, user profiles, conversation summaries, hierarchical retrieval, memory consolidation, memory decay…

But there’s a strange imbalance in this field: people spend a huge amount of energy inventing memory architectures, and far less energy improving the evals used to assess whether these systems actually help agents use memory better over time.

The result is a lot of over-engineering. People build memory systems based on their own narrow definition of “good memory.” Without a serious evaluation environment, these are essentially architectural bets.

My view is that memory is not a first-order fundamental capability of a system. Memory is a second-order effect—it naturally emerges when a system is placed under pressure to perform better across repeated interactions.

The right way to build better memory systems is not to design them. The right way is to build an environment where systems without good memory cannot survive.

The Problem with Static Memory Evals

Most memory evals are too static. They typically look like this:

• Give the system a set of prior user facts or conversation history • Ask a current question • Check whether the system retrieves the relevant memory • Score the answer based on whether it used the correct fact

It tests whether a memory system can retrieve relevant facts from a fixed snapshot. It does not test whether the system improves over time.

It doesn’t tell us how the system performed before reaching that fixed snapshot, nor does it predict how it will perform in the future.

From a product perspective, this is worrying. The memory system is part of its own feedback loop: if the perceived memory quality is poor or doesn’t noticeably improve, users may be discouraged from interacting in ways that require good memory, or may turn off the memory feature entirely.

The Ideal Longitudinal Memory Eval

An ideal memory eval should include the following components:

• A replayable user interaction history • Future interactions that depend on that history • A scoring system that rewards good memory usage

For example, a synthetic interaction history might look like this:

Interaction 1: User asks for help writing a technical blog post Interaction 2: User says they prefer concise, direct language Interaction 3: User rejects a draft for sounding too salesy … Interaction 200: User asks for help drafting an email to a contractor about renovation costs

At each point, the memory system can decide what to store, summarize, merge, etc. Then you test at selected checkpoints.

For example, after interaction 50: “Can you rewrite this to sound more like me?” — a system with good memory should know the user likes concise, direct, slightly dry humor.

Or after interaction 120: “Does this match the renovation budget we discussed?” — a good memory system should recall the rough context of the home project but also know not to be overconfident.

More Concrete Framework

Each data point should include:

User U

  • hidden profile (visible only to simulated users)
  • interaction history
  • checkpoints (organic & synthetic)
  • scoring rubrics

Key components:

User Simulation — Build a simulated user agent that generates realistic interaction behavior based on user profiles and past interactions. It could even proactively avoid certain topics due to bad memory experiences.

Scoring — Must consider multiple dimensions: should early bad memories be penalized more or less? Must cover evolutionary pressures like addition, update, forgetting, conflict resolution. Must balance memory quality against latency/cost. Must support attribution (knowing which part of the memory system caused the loss).

Conclusion

Most AI memory systems are designed top-down from personal imagination. But this is backwards. We shouldn’t start with system design. We should start with the survival environment.

Such an eval would quickly expose weaknesses in many approaches. The eventual winner might be a hybrid of basic engineering components, clever tricks, and overall trade-offs across multiple variables.

But which system wins isn’t the point. The point is: we shouldn’t try to design the winner; we should let the environmental pressure cause the winner to emerge.

Original: https://linghao.io/posts/memory-systems-should-be-evolved…

#AI Memory #Eval #System Design


Evolving Memory Systems: An Eval-First Approach

Source: https://linghao.io/posts/memory-systems-should-be-evolved Banner

People are building all sorts of fancy memory systems for AI assistants and personal agents. My X timeline is full of jargons like vector stores, knowledge graphs, semantic memory, episodic memory, user profiles, conversation summaries, hierarchical retrieval, memory consolidation, memory decay, ...

Make no mistake, many of these ideas are useful. Some are probably indispensable. But the field has a strange imbalance: we spend a lot of energy inventing memory architectures, and much less energy improving the evals used to assess whether these systems actually make agents use memory better over time.

The result is a lot of over-engineering. People build memory systems based on their own narrow definition of what “good memory” should look like. Without a serious evaluation environment, these are mostly just architectural bets.

My view is that memory is not a first-order fundamental capability of a system.Memory is a second-order effect that emerges when a system is placed under pressure to perform better across repeated interactions.

The right way to build better memory systems is not by designing them. The right way is to build an environment where systems without good memory cannot survive.

The Problem with Static Memory Evals

Most memory evals today are too static. They usually look something like this:

  • Give the system a set of prior user facts or conversation history.
  • Ask a current question.
  • Check whether the system retrieves the relevant memory.
  • Score the answer based on whether it used the right fact.

It tests whether a memory system can retrieve a relevant fact from a fixed snapshot. It does not test whether the system becomes better over time.

It does not tell us how the system performs before reaching this fixed snapshot in time, nor does it provide much predictive power to how it would perform into the future.

This is worrying from a product perspective. The memory system is part of its own feedback loop: if the perceived memory quality is bad or does not improve noticeably over time, users might become discouraged from interacting in ways that require good memory, or turn off the memory features altogether.

The Ideal Longitudinal Memory Eval

The ideal memory eval should consist of the following building blocks:

  • A replayable user interaction history
  • Future interactions that depend on the history
  • A scoring system that rewards good memory usage

To illustrate, imagine a synthetic interaction history like this:

Interaction 1: User asks for help writing a technical blog post. Interaction 2: User says they prefer concise, direct language. Interaction 3: User rejects a draft for sounding too salesy. Interaction 4: User asks for restaurant recommendations near LA downtown. Interaction 5: User says they dislike loud restaurants. Interaction 6: User starts a home renovation planning thread. Interaction 7: User mentions the budget is around $200K. ... Interaction 200: User asks for help drafting an email to a contractor.

At each point, the memory system may decide what to store, summarize, consolidate, etc. Then, at selected checkpoints, you test it.

For example:

Checkpoint after Interaction 50: User: Can you rewrite this to sound more like me?

A system with good memory should know from previous interactions that the user prefers concise, direct, slightly dry prose and dislikes over-explaining.

Or:

Checkpoint after Interaction 120: User: Does this fit with the house project numbers we discussed?

A system with good memory should remember the rough renovation context, but also avoid overconfidence if the numbers may have changed.

Making It More Concrete

Above is just a rough sketch. Many details need to be worked out to make such an eval really useful.

At a high level, we need to build a dataset combining human and synthetic data where each data point looks something like this:

`` User U

  • hidden profile (only visible to simulated users)
  • interaction history
  • checkpoints (both organic and synthetic)
  • scoring rubrics ``

For instance:

{ "user_id": "user_042", "hidden_profile": { "style": { "prefers_concise": true, "dislikes_hype": true, "likes_deadpan_humor": true }, "projects": { "home_renovation": { "budget": "$200k", "location": "Orange County, CA" } } }, "timeline": [ { "turn": 1, "user": "Can you help me polish this blog post?", "type": "normal" }, { "turn": 2, "user": "Can you make this argument sharper? Less academic, more direct.", "type": "normal" }, { "turn": 3, "user": "Rewrite this document in a similar way.", "type": "checkpoint_organic", "evaluation": { "scoring_rubric": "Output should follow a direct writing style and avoid being overly academic." } }, { "turn": 4, "user": "Help me research steps to get a planning permit for my home renovation project.", "type": "normal" }, // ... { "turn": 200, "user": "Write an email to follow up on the renovation cost with the general contractor.", // Generated by the simulated user agent. "type": "checkpoint_synthetic", "evaluation": { "scoring_rubric": "Email should pull in relevant context such as budget, home location, etc." } } ] }

The eval should be implementation-agnostic but also deeply connected with real-world product value-add and practical constraints. A few key pieces to consider:

User Simulation

To scale the eval data, we need to build simulated user agents that:

  • Produce realistic new interactions based on the user profile and past interactions
  • Model longitudinal behavioral changes due to past interactions and perceived quality/utility of the memory system.- E.g. It might avoid chatting about a certain topic if past interactions about this topic had been consistently suboptimal.
  • Produce the ground truth of synthetic new interactions. In a sense, we need to have an Oracle that gives the best response based on all observable data of a given user, likely using a lot more offline compute than what can be leveraged in production.

Scoring

The scoring mechanism and rubrics need to consider multiple aspects:

  • It needs to assign weights to successes/failures of a given interaction in a way that aligns with the intended product value.- E.g. If a bad memory usage happens early on in the user’s interaction history, should that be punished more or less?
  • It should cover various types of evolutional pressures, including scenarios like addition, updates, forgetting, conflict resolution, etc.
  • It should balance memory quality against other metrics such as serving latency and compute costs.- The system can do whatever it wants. Spending more compute either offline or online will generally lead to better results (e.g. by making a large number of LLM calls to a frontier reasoning model to synthesize and consolidate memories often). But the highest-scoring solution is not always the best one to launch to production.
  • It should support attribution (e.g. identifying which part of the memory system contributed to a loss) so that the eval can actually guide quality hill climbing.
  • ...

Conclusion

Most AI memory systems are designed top down based on individual imagination. But this is backwards. We should not start with the system design. Instead we should start with the survival environment.

An eval like this would quickly expose weaknesses in many approaches. The likely winner is a hybrid of several first-principled engineering components, clever tricks, and holistic trade-off between multiple variables.

But which system will win is beside the point. The point is that we should not try to design the winner, but let the environment pressure the winner to emerge.


Generative AI Usage Disclosure: Refined by ChatGPT.

Similar Articles

@yibie: Recommend this article. The teams from SJTU and Tsinghua systematically evaluated 12 agent memory systems. It's not one of those "our model is better" papers but rather breaks down how to choose memory systems from a data management perspective—when to use RAG, when to use vector databases, when to use knowledge graphs. Long-term memory for agents...

X AI KOLs Timeline

This paper from SJTU and Tsinghua systematically evaluates 12 agent memory systems from a data management perspective, decomposing memory into four modules and providing guidelines on when to use RAG, vector databases, or knowledge graphs for long-term agent memory.

AI memory products are optimizing for the wrong thing

Reddit r/AI_Agents

The article argues that current AI memory products prioritize personalization over truth and accountability, leading to systems that accumulate contradictions and cannot be reliably corrected; it questions whether personalization is sufficient for production use.

@chenchengpro: The more fancy "memory" architectures you stack on an LLM Agent, the better the results? Not necessarily. A new paper tested 12 memory systems and found no universal winner. It decomposes Agent memory like a database — representation & storage, extraction, retrieval & routing, and maintenance — and tested Mem0, Letta, Zep, C…

X AI KOLs Timeline

A paper systematically evaluates 12 LLM Agent memory systems, breaks them into four modules, finds no single architecture dominates all scenarios, and reveals cost-performance trade-offs and common issues (e.g., 'past hallucinations').