Cognis: Context-Aware Memory for Conversational AI Agents
Summary
Lyzr Cognis introduces a unified, open-source memory system for conversational AI that fuses BM25 and Matryoshka vector search with version-aware ingestion, achieving SOTA on LoCoMo and LongMemEval benchmarks.
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# 1 Introduction
Source: [https://arxiv.org/html/2604.19771](https://arxiv.org/html/2604.19771)
Cognis: Context\-Aware Memory for Conversational AI Agents
Parshva Daftari\*Khush Patel\*Shreyas Kapale\*Jithin GeorgeSiva Surendira Lyzr Research
\{parshva, khush, shreyas, jithin, siva\}@lyzr\.ai11footnotetext:Equal contribution
###### Abstract
LLM agents lack persistent memory, causing conversations to reset each session and preventing personalization over time\. We presentLyzr Cognis, a unified memory architecture for conversational AI agents that addresses this limitation through a multi\-stage retrieval pipeline\. Cognis combines a dual\-store backend pairing OpenSearch BM25 keyword matching with Matryoshka vector similarity search, fused via Reciprocal Rank Fusion\. Its context\-aware ingestion pipeline retrieves existing memories before extraction, enabling intelligent version tracking that preserves full memory history while keeping the store consistent\. Temporal boosting enhances time\-sensitive queries, and a BGE\-2 cross\-encoder reranker refines final result quality\. We evaluate Cognis on two independent benchmarks—LoCoMo and LongMemEval—across eight answer generation models, demonstrating state\-of\-the\-art performance on both\. The system is open\-source and deployed in production serving conversational AI applications\.
The rapid advancement of Large Language Models \(LLMs\) has enabled the development of sophisticated conversational AI agents capable of complex reasoning and natural language understanding\. However, these agents face a fundamental limitation: they operate within fixed context windows and lack persistent memory capabilities, causing each conversation to begin without knowledge of prior interactions\.
This limitation manifests in several practical problems:
- •Conversation discontinuity: Users must re\-establish context in every session
- •Lost personalization: Agents cannot learn user preferences over time
- •Repetitive interactions: Users repeatedly provide the same information
- •Shallow relationships: Agents cannot build rapport or trust through continuity
Existing approaches to LLM memory fall into two categories:retrieval\-augmented generation\(RAG\), which treats memory as a document retrieval problem, andspecialized memory systemslike Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1)\), Zep\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\), and SuperMemory\(SuperMemory AI,[2024](https://arxiv.org/html/2604.19771#bib.bib3)\), which provide memory\-specific abstractions\. While these systems represent important progress, they often rely on single retrieval modalities, lack sophisticated temporal reasoning, and do not maintain version history for evolving information\.
We presentLyzr Cognis, a unified memory architecture designed to address these limitations\. Our key contributions are:
1. 1\.Memory Taxonomy: A comprehensive classification system with 15 semantic categories \(e\.g\., personal details, professional, health\) and 2 persistence scopes \(USER for cross\-session, CONTEXT for session\-specific\)\.
2. 2\.Dual\-Store Architecture: A streamlined storage layer combining: - •OpenSearch for document storage, native BM25 search with configurable text analysis, and version history - •Vector Database \(VDB\) for dual\-dimension Matryoshka embeddings \(768D \+ 256D\) enabling efficient two\-stage semantic search
3. 3\.Context\-Aware Ingestion: An intelligent extraction pipeline that retrieves similar existing memories from the VDBbeforeLLM processing, enabling the model to make informed decisions about whether to ADD new facts, UPDATE existing ones \(with version linking\), DELETE contradicted information, or skip duplicates entirely\.
4. 4\.Hybrid Retrieval Pipeline: A sophisticated search system combining vector similarity and BM25 keyword matching through Reciprocal Rank Fusion \(RRF\) with 70% vector and 30% BM25 weighting, temporal boosting for time\-aware queries, content deduplication, and a BGE\-2 cross\-encoder reranker for final result refinement\.
5. 5\.Version Tracking: Full version history withis\_currentflags andreplaces\_idlinks, enabling historical queries like “What were all my previous jobs?”
6. 6\.Cross\-Benchmark Validation: Comprehensive evaluation on both LoCoMo and LongMemEval benchmarks across eight answer generation models, demonstrating that architectural advantages generalize across different evaluation frameworks and LLM backends, with up to 92\.4% accuracy on LongMemEval’s 500\-question benchmark\.
Single\-Hop\+25\.7%Multi\-Hop\+10\.0%Open\-Domain\+10\.5%Temporal\+21\.6%F1 Score0204060LegendMem0ZepMem0gCognis
Figure 1:LoCoMo benchmark F1 scores across four question types\. Cognis achieves the highest F1 in every category\. Percentage labels show Cognis’s gain over the strongest baseline\. Mem0 = Mem0 retrieval; Zep = Zep memory framework; Mem0g = Mem0 with graph memory enabled; Cognis = Lyzr Cognis \(ours\)\.SS\-User\+3\.0%SS\-AsstSS\-Pref\+33\.3%Know\.Update\+8\.7%Temp\.Reason\.\+20\.6%Multi\-Session\+22\.1%Accuracy \(%\)020406080100LegendZep/GraphitiSuperMemoryCognis
Figure 2:Cross\-system accuracy on LongMemEval across six question types\. Cognis \(orange\) leads on five of six categories, with the largest gains on preference recall \(\+33\.3%\) and temporal reasoning \(\+20\.6%\)\. SuperMemory leads only on SS\-Assistant\. Zep/Graphiti = Zep with Graphiti; SuperMemory = SuperMemory graph memory; Cognis = Lyzr Cognis \(ours, best across answer models\)\.The remainder of this paper is organized as follows: Section[2](https://arxiv.org/html/2604.19771#S2)reviews related work, Section[3](https://arxiv.org/html/2604.19771#S3)describes the system architecture, Section[4](https://arxiv.org/html/2604.19771#S4)details the ingestion pipeline, Section[5](https://arxiv.org/html/2604.19771#S5)explains the retrieval pipeline, Section[6](https://arxiv.org/html/2604.19771#S6)describes experimental setup, Section[7](https://arxiv.org/html/2604.19771#S7)presents results on the LoCoMo benchmark, Section[8](https://arxiv.org/html/2604.19771#S8)presents results on LongMemEval, Section[9](https://arxiv.org/html/2604.19771#S9)discusses findings and limitations, and Section[10](https://arxiv.org/html/2604.19771#S10)concludes\.
## 2Related Work
### 2\.1Memory Systems for LLM Agents
The challenge of providing persistent memory to LLM agents has spawned numerous approaches\. Early commercial solutions includeMem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1)\), which provides a memory layer with automatic fact extraction and vector\-based retrieval, andZep\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\), which offers long\-term memory with session management and temporal awareness\.SuperMemory\(SuperMemory AI,[2024](https://arxiv.org/html/2604.19771#bib.bib3)\)focuses on knowledge graph integration for multi\-hop reasoning capabilities\.
Recent academic work has advanced agent memory architectures significantly\.MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib10)\)reconceptualizes memory management through an operating system lens, treating the LLM as a processor with explicit memory hierarchies—main context as RAM and external storage as disk\. This approach enables virtual context management beyond fixed window limits but requires complex memory paging operations\. Unlike MemGPT’s OS\-style paging between context \(RAM\) and storage \(disk\), which requires explicit memory management operations to decide what to swap in and out, our dual\-store architecture keeps all memories simultaneously accessible through parallel retrieval—sacrificing the theoretical elegance of hierarchical memory for practical simplicity and lower operational complexity\.MemoryBank\(Zhonget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib9)\)enhances LLMs with long\-term memory through memory consolidation mechanisms inspired by human cognition, storing and updating memories during conversations and employing a memory retrieval mechanism during inference\.
ReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib8)\)takes a human\-inspired approach to processing long documents, building “gist memory” that captures essential information at multiple granularities\. The agent learns to decide what to remember and what to retrieve, mimicking how humans selectively process and retain information from lengthy texts\. While ReadAgent focuses on reading comprehension of long documents, our system addresses a different challenge: maintaining coherent memory across many short conversational exchanges over extended time periods, where the primary difficulty is not document length but temporal span and information evolution\.SimpleMem\(Liuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib4)\)proposes efficient lifelong memory through practical mechanisms for memory organization and retrieval, focusing on computational efficiency without sacrificing effectiveness\.
A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib5)\)introduces agentic memory mechanisms with structured approaches to memory management, allowing agents to autonomously organize and access their memory stores\.MemR3\(Duet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib6)\)combines reflective reasoning with memory retrieval, enabling agents to assess memory relevance through reasoning rather than simple similarity matching—a departure from purely embedding\-based approaches\. Our context\-aware ingestion shares this philosophy of using reasoning over raw similarity, but applies it at write time \(deciding how new information relates to existing memories\) rather than read time \(deciding which memories are relevant to a query\)\. TheHindsight Memoryapproach\(Latimeret al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib7)\)identifies three core capabilities essential for effective agent memory: retention \(what to store\), recall \(how to retrieve\), and reflection \(how to learn from past experiences\) for improved decision\-making\.
Our work synthesizes insights from these systems while addressing their limitations: \(1\) unlike MemGPT’s complex paging, we use a simpler dual\-store architecture with hybrid retrieval; \(2\) unlike single\-modality systems, we combine multiple retrieval approaches through RRF fusion; \(3\) we implement Matryoshka embeddings\(Kusupatiet al\.,[2022](https://arxiv.org/html/2604.19771#bib.bib16)\)for efficient two\-stage retrieval; \(4\) we maintain full version history for evolving information; and \(5\) we provide comprehensive temporal reasoning often lacking in existing systems\.
### 2\.2Retrieval\-Augmented Generation
RAG systems\(Lewiset al\.,[2020](https://arxiv.org/html/2604.19771#bib.bib11)\)augment LLM responses with retrieved context from external knowledge bases, achieving strong performance on knowledge\-intensive tasks without parameter updates\. While effective for static document retrieval, standard RAG architectures lack the temporal awareness, version tracking, and memory update mechanisms needed for conversational memory where information evolves over time\.
Recent advances have improved RAG capabilities\.CLaRa\(Heet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib12)\)bridges retrieval and generation through continuous latent reasoning, enabling joint optimization of both processes rather than treating them as separate stages\. This unified approach allows the model to reason over retrieved information more effectively\. Our hybrid retrieval approach extends RAG principles with BM25 keyword matching for exact term handling, temporal boosting for time\-aware queries, and BGE\-2\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)cross\-encoder reranking for final result refinement\.
### 2\.3Hybrid Search and Dense Retrieval
The complementary strengths of dense \(embedding\-based\) and sparse \(keyword\-based\) retrieval have motivated hybrid approaches\(Maet al\.,[2021](https://arxiv.org/html/2604.19771#bib.bib14)\)\. Dense retrieval excels at semantic similarity and paraphrase matching, while sparse methods like BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2604.19771#bib.bib22)\)capture exact term matches that embeddings may miss\. Reciprocal Rank Fusion \(RRF\)\(Cormacket al\.,[2009](https://arxiv.org/html/2604.19771#bib.bib13)\)provides a simple yet effective method for combining ranked lists from multiple retrievers without requiring score calibration\.
The accuracy benefits of hybrid search are well\-documented\. Queries containing specific names, dates, or technical terms are often missed by embedding\-based retrieval alone because dense models collapse lexically distinct but semantically similar tokens into nearby vectors\. BM25 anchors on exact tokens and recovers these cases, while vector search handles paraphrased or semantically equivalent queries that keyword matching would miss\. By fusing both ranked lists through RRF, the resulting pipeline achieves higher recall and precision than either modality in isolation, particularly on entity\-heavy and temporal queries where one modality alone consistently underperforms\.
Recent work on dense retrieval optimization includesCADET\(Tamberet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib15)\), which demonstrates that cross\-encoder listwise distillation with synthetic data can significantly improve dense retriever performance beyond conventional contrastive learning approaches\. This finding informed our decision to include a BGE\-2 cross\-encoder reranker as a final refinement stage\. Adding a cross\-encoder after initial bi\-encoder retrieval provides a second pass of fine\-grained relevance scoring, catching subtle mismatches that bi\-encoder dot\-product similarity overlooks and yielding measurable accuracy gains on multi\-hop and open\-domain queries\. We extend these techniques with explicit temporal relevance scoring to address a gap in existing hybrid search systems\.
### 2\.4Embedding Representations
Modern embedding models have achieved strong performance on semantic similarity tasks\.Matryoshka Representation Learning\(Kusupatiet al\.,[2022](https://arxiv.org/html/2604.19771#bib.bib16)\)introduced the concept of training embeddings that remain effective when truncated to lower dimensions, enabling adaptive compute\-accuracy tradeoffs\. We leverage this property for two\-stage retrieval: fast shortlisting with truncated 256D embeddings followed by accurate ranking with full 768D embeddings\.
TheBGE M3\-Embedding\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)family provides multi\-lingual, multi\-functionality embeddings through self\-knowledge distillation\. The associated BGE reranker models serve as cross\-encoders that jointly encode query\-document pairs, providing more accurate relevance judgments than bi\-encoder similarity at the cost of increased computation\. We use the BGE\-2 reranker as our final refinement stage\.
### 2\.5Query Understanding and Attention
Effective memory retrieval depends on understanding user intent\.System 2 Attention\(Weston and Sukhbaatar,[2023](https://arxiv.org/html/2604.19771#bib.bib18)\)addresses attention limitations in Transformers by using LLM reasoning to filter irrelevant context before processing, improving focus on pertinent information\.Rephrase and Respond\(Denget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib19)\)shows that having LLMs reformulate questions before answering improves performance across diverse tasks by clarifying ambiguous queries\. These insights inform our query analysis pipeline, which detects temporal intent \(triggering time\-based boosting\) and history keywords \(enabling version chain traversal\)\.
### 2\.6Cognitive Science Foundations
Our memory taxonomy draws on cognitive science research distinguishing memory types\. Tulving’s foundational work\(Tulving,[1972](https://arxiv.org/html/2604.19771#bib.bib23)\)established the distinction between episodic memory \(autobiographical events\) and semantic memory \(factual knowledge\), which informs our decay rate assignments\. The Atkinson\-Shiffrin model\(Atkinson and Shiffrin,[1968](https://arxiv.org/html/2604.19771#bib.bib24)\)of human memory with its multi\-store architecture \(sensory, short\-term, long\-term\) inspired our separation of immediate recall \(raw messages\) from consolidated memories \(extracted facts\)\.
### 2\.7Benchmarks for Long\-Term Memory
Evaluating long\-term memory systems requires specialized benchmarks that test persistence and temporal reasoning\.LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib20)\)provides a comprehensive benchmark evaluating five critical capabilities: information extraction, multi\-session reasoning, temporal reasoning, knowledge updates, and abstention \(correctly declining to answer when information is unavailable\)\. The benchmark reveals a concerning 30% accuracy drop in commercial systems during sustained interactions, highlighting the difficulty of maintaining coherent long\-term memory\.
TheLoCoMobenchmark\(Maharanaet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib21)\)specifically tests memory systems across multi\-session conversations requiring recall across 50\+ sessions, with distinct question categories \(single\-hop, multi\-hop, open\-domain, temporal\) that expose different failure modes\. We adopt LongMemEval’s evaluation methodology with question type\-specific judge prompts while evaluating on LoCoMo’s challenging multi\-session scenarios\.
## 3System Architecture
### 3\.1Overview
The Lyzr Cognis system follows a streamlined architecture consisting of an orchestration engine \(the Unified Memory Provider\) and a dual\-store storage backend\. This design prioritizes simplicity and efficiency while maintaining powerful hybrid retrieval capabilities\.
The architecture deliberately avoids unnecessary complexity\. While graph databases and caching layers can provide benefits in specific scenarios, our empirical evaluation showed that the combination of OpenSearch’s native BM25 search with dual\-dimension vector search provides excellent retrieval quality with simpler operational requirements\. Our ablation studies \(Section[7\.3](https://arxiv.org/html/2604.19771#S7.SS3)\) demonstrate that OpenSearch’s native BM25 implementation significantly outperforms MongoDB’s text indexing, particularly on open\-domain queries requiring broad entity matching\.
### 3\.2Memory Taxonomy
We introduce a comprehensive taxonomy for classifying memories along three dimensions, drawing on cognitive science research into human memory organization\(Tulving,[1972](https://arxiv.org/html/2604.19771#bib.bib23); Atkinson and Shiffrin,[1968](https://arxiv.org/html/2604.19771#bib.bib24)\)\.
### 3\.3Dual\-Store Architecture
Our architecture employs two complementary storage systems, each optimized for different access patterns \(Table[1](https://arxiv.org/html/2604.19771#S3.T1)\)\.
Table 1:Dual\-Store Data Distribution#### 3\.3\.1OpenSearch Indexes
OpenSearch serves as the document store with two primary indexes\. Themessagesindex stores raw conversation messages with timestamps, speaker information, and processing status flags\. Thememoriesindex contains extracted facts with full metadata including category, scope, version tracking fields \(is\_current,replaces\_id\), and event timestamps for temporal reasoning\.
OpenSearch provides native BM25 search with configurable text analyzers, offering superior scoring control compared to MongoDB’s text indexing\. Our ablation experiments show that this switch from MongoDB to OpenSearch yields the single largest performance improvement, particularly a \+20\.3% gain on open\-domain LLM Judge scores \(Section[7\.3](https://arxiv.org/html/2604.19771#S7.SS3)\)\. The native BM25 implementation with proper tokenization and term\-frequency weighting excels at matching specific entity names, dates, and technical terms that pure semantic search may miss\.
#### 3\.3\.2VDB Collections
The vector database maintains three collections for efficient retrieval:
Table 2:VDB Collection StructureOpenSearchNative BM25memoriesmessagesnative BM25, version trackingVector DB\*\_768d \(accurate\)\*\_256d \(fast\)immediate\_recallMatryoshka embeddings
Figure 3:Dual\-store architecture: OpenSearch \(documents \+ native BM25\) and VDB \(Matryoshka embeddings at 768D/256D\)\.This dual\-store approach provides the best of both worlds: OpenSearch’s native BM25 search with configurable text analysis combined with the VDB’s high\-performance vector similarity search\. The separation also allows independent scaling of document and vector workloads\.
## 4Ingestion Pipeline
The ingestion pipeline transforms raw conversation messages into structured, searchable memories\. A key innovation is that the LLM extraction step receives context from similar existing memories, enabling intelligent decisions about how to handle new information relative to what the system already knows \(Figure[4](https://arxiv.org/html/2604.19771#S4.F4)\)\.
IMMEDIATE STORAGERaw MessagesOpenSearchVDBCONTEXT\-AWARE EXTRACTIONEmbedTop\-10LLMADD/UPDVersionOpenSearchVDBcontext
Figure 4:Two\-panel ingestion pipeline\.Left: Immediate storage enables recall before extraction\.Right: Context\-aware extraction retrieves similar memories, LLM decides ADD/UPDATE/DELETE, version tracking maintains history\.### 4\.1Message Storage and Immediate Recall
When messages arrive via the API, they undergo two immediate storage operations\. First, OpenSearch stores messages in themessagesindex with metadata including timestamp, speaker, session ID, and aprocessed: falseflag\. Simultaneously, a 256D embedding is generated and stored in the VDB’simmediate\_recall\_256dcollection, enabling instant retrieval of recent conversation context\. This dual storage ensures that even before fact extraction completes, the system can retrieve relevant conversation history for queries requiring immediate recall\.
### 4\.2Speaker Identification
Messages follow a structured format that enables speaker identification:
\[2024\-05\-0810:30:00\]James:IjustgotanewjobatGoogle\!
\[2024\-05\-0810:30:15\]Assistant:Congratulations\!That’sexcitingnews\.
The extraction system parses this format to extract the speaker name \(“James”\) for proper attribution, calculate absolute dates from relative references \(“yesterday”→\\rightarrowspecific date based on message timestamp\), and associate facts with the correct user in multi\-party conversations\.
### 4\.3Context\-Aware LLM Extraction
The core innovation of our ingestion pipeline is that fact extraction isnotperformed in isolation\. Before the LLM processes new messages, the system generates an embedding for the new message content, searches the VDB for the top\-10 most similar existing memories, and provides both the new messages and existing memories to the LLM\. Figure[5](https://arxiv.org/html/2604.19771#S4.F5)illustrates why this context\-aware approach matters\. Without retrieving existing memories, the LLM treats each extraction independently, leading to conflicting or duplicate entries\. With context retrieval, the system maintains a consistent, evolving knowledge base\.
Without Context RetrievalNew: “Promoted to Sr\. Engineer”Existing: “James is Engineer”\(ignored\)LLM \(no context\)↓\\downarrowADD operation✗“James is Sr\. Engineer”✗“James is Engineer”Two conflicting memories\!With Context Retrieval \(Cognis\)New: “Promoted to Sr\. Engineer”Retrieved: “James is Engineer” \(id=12\)LLM \(with context\)↓\\downarrowUPDATE replaces\_id=12✓“James is Sr\. Engineer”\(replaces \#12\)Single consistent memory \+ history
Figure 5:Context\-aware extraction comparison\.Left: Without context, the LLM creates conflicting memories \(existing memory ignored\)\.Right: With context retrieval, the LLM issues UPDATE operations maintaining consistency\.This context enables the LLM to make intelligent decisions about how to handle new information relative to what already exists in the memory store\. Table[3](https://arxiv.org/html/2604.19771#S4.T3)summarizes the four possible operations:ADDfor genuinely new facts,UPDATEfor information that supersedes existing memories \(with version linking\),DELETEfor explicitly contradicted information, andNONEfor duplicates or trivial content not worth storing\.
Table 3:LLM Operation Decision LogicFor example, when processing the message “I got promoted to Senior Engineer,” the LLM returns structured JSON with the appropriate operations:
\{
"operations":\[
\{
"action":"UPDATE",
"fact":"JamesworksatGoogleasaSeniorEngineer",
"replaces\_id":42,
"category":"professional",
"event\_date":"2024\-05\-08"
\},
\{
"action":"ADD",
"fact":"Jamesisexcitedabouthisnewrole",
"category":"emotional"
\}
\]
\}
This approach prevents duplicate memories, maintains consistency when information changes, and ensures the memory store reflects the most current understanding of the user’s world\.
### 4\.4Matryoshka Embedding Generation
For extracted memories, we generate dual\-dimension embeddings using the Matryoshka approach\(Kusupatiet al\.,[2022](https://arxiv.org/html/2604.19771#bib.bib16)\)\. The system first generates a full 768D embedding with the prefixsearch\_document:, then truncates to 256D \(𝐞256=𝐞768\[0:256\]\\mathbf\{e\}\_\{256\}=\\mathbf\{e\}\_\{768\}\[0:256\]\), and finally re\-normalizes the result \(𝐞256=𝐞256‖𝐞256‖2\\mathbf\{e\}\_\{256\}=\\frac\{\\mathbf\{e\}\_\{256\}\}\{\\\|\\mathbf\{e\}\_\{256\}\\\|\_\{2\}\}\)\. This approach, pioneered by Matryoshka Representation Learning, ensures that the truncated embedding preserves semantic meaning while enabling faster approximate search\. The 256D embedding captures the coarse\-grained semantics sufficient for shortlisting, while the full 768D embedding provides fine\-grained discrimination for accurate ranking\.
### 4\.5Dual\-Store Persistence
Each extracted memory is stored across both backends\.OpenSearchstores a document containing full text content and metadata \(category, scope\), version tracking fields \(is\_current,replaces\_id,version\), temporal information \(event\_time,created\_at\), and BM25\-indexed text fields with configurable analyzers\.VDBstores two vector points with payloads: the 768D collection holds full embeddings for accurate retrieval, while the 256D collection contains truncated embeddings for fast shortlisting\. Both collections includememory\_id,user\_id,is\_current, andevent\_timein their payloads\.
### 4\.6Version Tracking and History
Lyzr Cognis implements git\-like automatic versioning with zero overhead\. EveryUPDATEoperation automatically creates a new version while preserving full history—enabling time\-travel capabilities for debugging and audit trails\.
When the LLM returns anUPDATEoperation, the system preserves history through version chaining\. First, it marks the old memory by settingis\_current=falseandstatus=‘‘historical’’in both OpenSearch and VDB\. Then it creates a new memory withreplaces\_idpointing to the old memory’s ID, and increments the version number \(version = old\_version \+ 1\)\.
time“James works as Software Engineer”id: 101version: 1replaces\_id: nullis\_current: falseHISTORICALv1Jan 2023“James works as Senior Engineer”id: 142version: 2replaces\_id: 101is\_current: falseHISTORICALv2Aug 2023“James works as Tech Lead”id: 187version: 3replaces\_id: 142is\_current: trueCURRENTv3Feb 2024replaces\_idreplaces\_idQuery:“What were all my previous jobs?”→\\rightarrowReturns: v1, v2, v3 \(full history\)
Figure 6:Version chaining for memory history\. Each UPDATE creates a new version linked viareplaces\_id\. Historical versions haveis\_current=false, enabling time\-travel queries that traverse the chain to retrieve complete evolution of facts\.This approach provides several benefits: automatic versioning where updates or deletes create new versions without explicit user action; a full audit trail showing how information evolved over time; time\-travel queries that retrieve any previous state \(e\.g\., “What were all my previous jobs?”\); and rollback capability to debug what changed and when by traversing version chains\.
This chain enables powerful historical queries\. When a user asks “What were all my previous jobs?”, the retrieval pipeline can traverse the version chain to return the complete employment history, not just the current position\.
## 5Retrieval Pipeline
The retrieval pipeline implements a sophisticated hybrid search combining vector similarity and keyword matching, with temporal boosting and a BGE\-2 cross\-encoder reranker for final refinement \(Figure[7](https://arxiv.org/html/2604.19771#S5.F7)\)\.
ANALYZESEARCHFUSEREFINEOUTPUTQueryAnalysisVectorBM25RRF70\+30TemporalDedupBGE\-2Results
Figure 7:Retrieval pipeline: Query analysis→\\rightarrowparallel Vector/BM25 search→\\rightarrowRRF fusion \(70%\+30%\)→\\rightarrowtemporal boost, dedup→\\rightarrowBGE\-2 rerank→\\rightarrowresults\.### 5\.1Query Analysis
Before search execution, we analyze the query to determine retrieval strategy:
Temporal Intent Detection: Keywords like “when”, “yesterday”, “last week”, “on May 8th” trigger temporal boosting\. The system extracts the time reference and calculates appropriate time windows for relevance scoring\.
History Detection: Keywords like “previous”, “all my”, “history”, “journey”, “over time” indicate historical queries that should include superseded memories \(is\_current=false\), enabling retrieval of complete version chains\.
This analysis shapes both the search filters and post\-processing stages\.
### 5\.2Matryoshka Two\-Stage Vector Search
We implement efficient two\-stage retrieval using Matryoshka embeddings, trading off speed and accuracy\.Stage 1 \(Fast Shortlisting\)searches the 256D collection for 200 candidate memories with∼\\sim5\-10ms latency, using coarse\-grained semantics for rapid filtering\.Stage 2 \(Accurate Re\-ranking\)filters the 768D collection by the shortlist memory IDs and computes precise similarity with full embeddings, achieving high\-precision semantic discrimination in∼\\sim10\-20ms\.
MATRYOSHKA TWO\-STAGE RETRIEVALQuery 768D256DtruncateSTAGE 1: FAST256D Collection200 candidates∼\\sim5\-10msSTAGE 2: ACCURATE768D CollectionTop N results∼\\sim10\-20msFinal Resultsfilter IDsCoarse\-grainedsemanticsFine\-graineddiscrimination
Figure 8:Matryoshka two\-stage retrieval: truncated 256D embeddings enable fast shortlisting \(200 candidates in∼\\sim5ms\), followed by accurate 768D re\-ranking \(∼\\sim15ms\) for final results\.Two\-stage Matryoshka retrieval provides substantial latency improvements with minimal accuracy impact\. In our benchmarks, p50 latency drops from 32ms \(single\-stage 768D\) to 18ms \(256D shortlist \+ 768D filter\), a 44% reduction\. The p99 latency improves from 89ms to 51ms, a 43% reduction that matters for tail latency in production systems\. The accuracy cost is minimal: in our testing, temporal F1 drops by only 1\.4% when using single\-stage 768D search, confirming that the 256D coarse filtering preserves the candidates that matter for final ranking\.
### 5\.3BM25 Text Search
OpenSearch provides native BM25 search that complements semantic similarity\. BM25 excels at exact name matching \(“James”, “Google”\), technical terms and acronyms, dates and numbers, and rare words with high discriminative power\. Our ablation studies \(Section[7\.3](https://arxiv.org/html/2604.19771#S7.SS3)\) demonstrate that OpenSearch’s native BM25 implementation with configurable text analyzers significantly outperforms MongoDB’s text indexing, yielding a \+20\.3% improvement on open\-domain LLM Judge scores\. The query is executed against OpenSearch’s BM25 index with user isolation:
\{
"query":\{
"bool":\{
"must":\{"match":\{"content":query\}\},
"filter":\[
\{"term":\{"owner\_id":user\_id\}\},
\{"term":\{"is\_current":true\}\}
\]
\}
\}
\}
### 5\.4Reciprocal Rank Fusion
We combine results from vector and BM25 search using Reciprocal Rank Fusion \(RRF\)\(Cormacket al\.,[2009](https://arxiv.org/html/2604.19771#bib.bib13)\):
RRF\(d\)=∑r∈R1k\+rankr\(d\)\\text\{RRF\}\(d\)=\\sum\_\{r\\in R\}\\frac\{1\}\{k\+\\text\{rank\}\_\{r\}\(d\)\}\(1\)
wherek=10k=10is a smoothing constant andrankr\(d\)\\text\{rank\}\_\{r\}\(d\)is the rank of documentddin retrieverrr\.
The final fused score combines weighted contributions:
scorefused=0\.70⋅RRFvector\+0\.30⋅RRFBM25\\text\{score\}\_\{\\text\{fused\}\}=0\.70\\cdot\\text\{RRF\}\_\{\\text\{vector\}\}\+0\.30\\cdot\\text\{RRF\}\_\{\\text\{BM25\}\}\(2\)
The 70/30 vector\-BM25 weighting reflects an empirical observation about query types in conversational memory\. The majority of queries seek semantic similarity \(“Tell me about my hobbies”, “What do I like to do?”\) where vector search excels at matching paraphrased concepts\. However, approximately 30% of queries contain specific anchors—names \(“What did Sarah say?”\), dates \(“meeting on March 15th”\), technical terms \(“my AWS credentials”\)—where BM25’s exact matching provides critical signal that embeddings may dilute\.
We evaluated alternative weightings during development: equal weighting \(50/50\) over\-weighted keyword matches for paraphrase queries, causing irrelevant memories with incidental term overlap to outrank semantically relevant ones\. Conversely, 80/20 weighting missed important exact matches when users queried specific entities by name\. The 70/30 balance consistently outperformed alternatives across all four question types in our validation set, achieving the best trade\-off between semantic coverage and lexical precision\.
### 5\.5Temporal Boosting
For queries with temporal intent, we apply time\-based relevance scoring\. The temporal score measures how close a memory’s event time is to the query’s temporal reference:
temporal\_score=max\(0\.1,1−\|event\_time−query\_date\|window\_days\)\\text\{temporal\\\_score\}=\\max\\left\(0\.1,1\-\\frac\{\|\\text\{event\\\_time\}\-\\text\{query\\\_date\}\|\}\{\\text\{window\\\_days\}\}\\right\)\(3\)
The final score integrates temporal relevance:
scorefinal=0\.60⋅scorefused\+0\.40⋅temporal\_score\\text\{score\}\_\{\\text\{final\}\}=0\.60\\cdot\\text\{score\}\_\{\\text\{fused\}\}\+0\.40\\cdot\\text\{temporal\\\_score\}\(4\)
This boosting is crucial for questions like “What did I do last Tuesday?” where temporal proximity should outweigh pure semantic similarity\.
### 5\.6Content Deduplication
Post\-fusion, we remove near\-duplicate memories to ensure diversity in results\. Two memories are considered duplicates if their semantic similarity score exceeds 99%\. When duplicates are found, only the highest\-scoring version is retained\.
This is particularly important when version chains exist—without deduplication, both current and historical versions of the same fact might appear in results\.
### 5\.7BGE\-2 Cross\-Encoder Reranking
As a final refinement stage, we apply a cross\-encoder reranker to the top candidates\. The BGE\-2 reranker\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)jointly encodes the query and each candidate document, providing more accurate relevance scores than bi\-encoder similarity alone\.
The reranker operates via a remote HTTP endpoint, sending the query and top\-N candidates \(post\-deduplication\) to the reranker service, receiving refined relevance scores for each candidate, and re\-sorting results by these scores\. This stage adds approximately 20\-50ms latency but provides significant quality improvements, particularly for nuanced queries where initial retrieval may not perfectly order results\.
### 5\.8Historical Query Handling
Wheninclude\_historical=True\(detected via query analysis\), the pipeline adjusts its behavior by removing theis\_current=truefilter from both vector and BM25 searches, including both current and superseded memories in results, traversing version chains viareplaces\_idlinks, and sorting results chronologically \(oldest first\) to show evolution\. This enables queries like “What were all my previous jobs?” to return the complete employment history with version relationships\.
## 6Experimental Setup
### 6\.1Datasets
We evaluate Cognis on two complementary benchmarks that test different aspects of long\-term memory\.
#### 6\.1\.1LoCoMo
TheLoCoMo\(Long\-Context Modeling\) benchmark\(Maharanaet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib21)\)tests memory systems across multi\-session conversations requiring recall across 50\+ sessions\. LoCoMo provides four distinct question categories that test different memory capabilities:Single\-Hopquestions require direct fact recall from a single session \(e\.g\., “What is James’s favorite sport?”\);Multi\-Hopquestions involve reasoning across multiple facts \(e\.g\., “What indoor activity would Alice enjoy with her dog?”\);Open Domainquestions are broad and require comprehensive recall \(e\.g\., “Tell me about James’s hobbies”\); andTemporalquestions are time\-sensitive \(e\.g\., “What did James do last Tuesday?”\)\. This categorization enables fine\-grained analysis of where memory systems succeed or struggle\.
#### 6\.1\.2LongMemEval
TheLongMemEvalbenchmark\(Wuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib20)\)is a comprehensive test of chat assistant long\-term memory comprising 500 questions spanning 6 question types\. LongMemEval defines six question types that systematically test different memory capabilities\.SS\-Userquestions test recall of single\-session user\-stated facts \(e\.g\., “What city did I say I moved to?”\)\.SS\-Assistantquestions test recall of assistant responses \(e\.g\., “What recipe did you recommend last time?”\)\.SS\-Preferencequestions test user preference recall \(e\.g\., “Do you remember my dietary restrictions?”\)\.Multi\-Sessionquestions require cross\-conversation reasoning, connecting information scattered across separate interactions\.Temporal Reasoningquestions test time\-dependent queries where the answer depends on when events occurred relative to each other\.Knowledge Updatequestions test handling of changed information, where a user corrects or updates a previously stated fact\.
The key methodological difference from LoCoMo is that LongMemEval’s multi\-session design tests whether memory systems can maintain coherence across conversation boundaries—a challenge that exercises both ingestion \(correctly associating facts across sessions\) and retrieval \(bridging semantic gaps between separate conversations\) capabilities\.
### 6\.2Evaluation Methodology
Both benchmarks share a 5\-phase evaluation pipeline:INGESTloads conversation sessions into each memory system;INDEXINGwaits for providers to complete extraction and embedding;SEARCHqueries memories for each evaluation question;ANSWERgenerates answers using retrieved memories; andEVALUATEscores answers using question type\-specific LLM judges\. We use specialized judge prompts for different question types \(see Appendix[A](https://arxiv.org/html/2604.19771#A1)\), including off\-by\-one tolerance for temporal questions, rubric\-based scoring for preference questions, and appropriate handling of knowledge updates\. LoCoMo uses GPT\-4 for answer generation; LongMemEval uses GPT\-4\.1 for answer generation to isolate memory architecture differences from LLM capability variation\.
### 6\.3Metrics
#### 6\.3\.1LoCoMo Metrics
We report three complementary metrics that capture different aspects of answer quality\.F1 Scoremeasures token\-level precision and recall between generated and ground truth answers, providing a strict assessment of factual content overlap\.BLEU\-1 \(B1\)captures unigram lexical similarity, offering a softer measure of word\-level correspondence\. Finally,LLM Judge \(J\)uses GPT\-4 to evaluate answer correctness on a 0\-100 scale, capturing semantic equivalence beyond surface\-level lexical matching—particularly important when correct answers may be paraphrased or expressed differently than the reference\.
#### 6\.3\.2LongMemEval Metrics
We report accuracy as the primary metric, measuring end\-to\-end correctness\. Retrieval quality is independently assessed via LLM\-based chunk relevance evaluation, supplemented by Hit@K, Mean Reciprocal Rank \(MRR\), Normalized Discounted Cumulative Gain \(NDCG\), Precision@K, Recall@K, and F1@K\.
### 6\.4Baselines
#### 6\.4\.1LoCoMo Baselines
We compare against 11 systems spanning different approaches to LLM memory\. These include theLoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib21)\)benchmark’s baseline retrieval approach; document\-focused methods likeReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib8)\)with its gist\-based memory andMemoryBank\(Zhonget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib9)\)with memory consolidation; architecture\-driven approaches includingMemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib10)\)\(OS\-inspired memory paging\),A\-Mem\(Xuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib5)\)\(agentic memory with structured management\), andA\-Mem\*\(A\-Mem variant with LLM\-as\-a\-Judge evaluation\); framework integrations such asLangMem\(LangChain\-based\) andOpenAI’s memory API; dedicated memory servicesZep\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\)\(long\-term memory with session management\) andMem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1)\)\(vector\-based with automatic extraction\); andMem0g, an enhanced variant of Mem0 incorporating graph\-based knowledge representation for improved multi\-hop reasoning\.
#### 6\.4\.2LongMemEval Baselines
We compare against two competitive systems:SuperMemory\(SuperMemory AI,[2024](https://arxiv.org/html/2604.19771#bib.bib3)\), which focuses on knowledge graph integration for multi\-hop reasoning, andZep/Graphiti\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\), which combines long\-term memory with graph\-based knowledge representation\. All systems use GPT\-4\.1 for answer generation\.
## 7Results on LoCoMo
### 7\.1Main Results
Table[4](https://arxiv.org/html/2604.19771#S7.T4)presents comprehensive results on the LoCoMo benchmark across all four question types\. We report F1 score \(F1\), BLEU\-1 \(B1\), and LLM\-as\-a\-Judge score \(J\) where available, with higher values indicating better performance \(↑\\uparrow\)\.
Table 4:Performance comparison on LoCoMo benchmark across question types\. Metrics: F1 score, BLEU\-1 \(B1\), LLM\-as\-a\-Judge \(J\)\. Best inbold, second\-bestunderlined\. \(↑\\uparrow\) = higher is better\.Single\-hop questions\.Cognis achieves 48\.66 F1 \(\+25\.7% over Mem0\), demonstrating that context\-aware extraction prevents memory pollution and that OpenSearch’s native BM25 significantly improves entity\-specific recall\. The LLM Judge improvement \(\+7\.2%\) confirms semantic retrieval quality beyond lexical matching\.
Multi\-hop questions\.Cognis achieves 31\.51 F1 \(\+10\.0% over Mem0\)\. Our hybrid retrieval successfully surfaces related facts by combining semantic similarity with OpenSearch BM25 keyword matching\. The gain reflects improved cross\-fact retrieval through better term matching, though multi\-hop reasoning remains an inherently difficult open challenge\.
Open\-domain questions\.Cognis achieves 54\.77 F1 \(\+10\.5% over Zep\) and a striking 85\.85 LLM Judge \(\+12\.1% over Zep’s 76\.60\)\. This represents a dramatic improvement: Cognis now leads onbothF1 and LLM Judge for open\-domain questions\. OpenSearch’s native BM25 with configurable text analysis is the key driver—broad entity queries benefit substantially from proper tokenization and term\-frequency weighting that MongoDB’s text indexing could not provide\.
Temporal questions\.Our strongest F1 gains: 62\.68 F1 \(\+21\.6% over Mem0g\) and 77\.26 LLM Judge \(\+32\.9%\)\. This validates Cognis’s temporal boosting mechanism, which adjusts scores based on proximity between query references and memory event times\. The BLEU\-1 score of 58\.95 \(\+45\.5% over Mem0’s 40\.51\) indicates substantially better lexical coverage in temporal answer generation\.
### 7\.2Error Analysis
To understand where Cognis still fails, we manually analyzed 50 incorrect temporal question responses—the category where we see the largest gains but also the most room for improvement\.
Error distribution: Of the analyzed failures, 34% stemmed from retrieval failures \(correct memory exists but was not retrieved in top\-K\), 28% from temporal boosting mismatch \(wrong memory boosted due to date proximity\), 22% from reranker ordering errors \(correct memory retrieved but ranked below incorrect alternatives\), and 16% from LLM reasoning errors \(correct memories retrieved and ranked, but answer generation failed\)\.
Illustrative failure case: Consider the query “What did James eat for breakfast last Tuesday?” The system retrieved “James enjoys oatmeal for breakfast” \(a general preference stored months earlier\) rather than “James had eggs and toast on Tuesday morning” \(the specific event from the target date\)\. The temporal boosting mechanism correctly identified the query as time\-sensitive, but it boosted the preference fact because that memory document happened to be created closer to the query date—even though the query sought a specific dated event\.
This reveals a limitation in Cognis’s current temporal boosting: it operates on memorystorage timestampsrather thanevent timestamps\. When the event date is explicitly mentioned in the memory content \(e\.g\., “on Tuesday morning”\), the system should use that date for temporal boosting rather than the storage date\. Improving event date extraction during ingestion would prevent this class of errors\.
### 7\.3Ablation Studies
We conduct two sets of ablation experiments to quantify the contribution of individual architectural components: \(1\) the impact of embedding model choice on retrieval quality, and \(2\) the impact of retrieval pipeline variants including storage backend, reranker choice, and query decomposition strategy\.
#### 7\.3\.1Embedding Model Comparison
Table[5](https://arxiv.org/html/2604.19771#S7.T5)compares four embedding models within our retrieval pipeline, holding all other components constant\. Each model generates dual\-dimension \(768D \+ 256D\) Matryoshka embeddings for two\-stage retrieval\.
Table 5:Embedding model ablation on LoCoMo benchmark\. All configurations use the same retrieval pipeline \(RRF fusion \+ BGE\-2 reranking \+ OpenSearch BM25\)\. Best per\-column inbold\.Key findings\.Embedding model choice has a pronounced and category\-dependent impact on retrieval quality\.Nomic Embedachieves the highest single\-hop F1 \(50\.23\) and temporal F1 \(60\.48\), excelling at direct fact recall and time\-sensitive queries\.Gemma Embedshows a striking F1\-Judge divergence: despite the lowest single\-hop F1 \(40\.94\), it achieves the highest Judge scores for both single\-hop \(71\.88\) and multi\-hop \(69\.23\), suggesting that its embeddings capture semantic correctness better than token\-level overlap\.Nomic V2leads on multi\-hop \(F1=30\.31\) and open\-domain \(F1=51\.67\), demonstrating stronger cross\-fact retrieval\.Jina V3delivers balanced performance without leading any single category\.
These results suggest that embedding model selection creates a fundamental tradeoff between token\-level precision \(F1\) and semantic correctness \(Judge\), and that task\-specific embedding selection—or ensemble strategies—may yield further improvements\.
#### 7\.3\.2Retrieval Pipeline Ablation
Table[6](https://arxiv.org/html/2604.19771#S7.T6)isolates the impact of three architectural choices: BM25 storage backend \(MongoDB vs\. OpenSearch\), cross\-encoder reranker \(BGE\-2 vs\. Zero Entropy\), and query preprocessing \(direct vs\. LLM decomposition\)\.
Table 6:Retrieval pipeline ablation on LoCoMo benchmark\. All configurations use the same embedding model\. Best per\-column inbold\.OpenSearch vs\. MongoDB BM25\.The single largest architectural improvement comes from switching the BM25 backend from MongoDB to OpenSearch\. Comparing the BGE\-2 \(MongoDB\) row with BGE\-2 \+ OpenSearch, we observe: \+7\.3% single\-hop F1 \(45\.33→\\rightarrow48\.66\), \+3\.1% multi\-hop F1 \(30\.55→\\rightarrow31\.51\), and most dramatically, \+20\.3% open\-domain Judge \(71\.34→\\rightarrow85\.85\)\. OpenSearch’s native BM25 implementation with configurable text analyzers provides substantially better term matching for broad entity queries, where proper tokenization and term\-frequency weighting are critical\.
Reranker comparison\.BGE\-2 outperforms the Zero Entropy reranker on multi\-hop F1 \(30\.55 vs\. 27\.50,Δ\\Delta=\+11\.1%\) and single\-hop Judge \(70\.21 vs\. 68\.23\), while Zero Entropy shows slightly higher single\-hop F1 \(45\.95 vs\. 45\.33\) and temporal BLEU\-1 \(58\.76 vs\. 53\.43\)\. BGE\-2’s advantage on multi\-hop reasoning—where nuanced relevance discrimination between related facts is critical—motivated its selection as the default reranker\.
LLM query decomposition\.Decomposing queries into sub\-questions via an LLM before retrieval yields marginal gains on temporal Judge \(77\.15, the highest in this column\) but consistently decreases performance on simpler question types:−\-1\.3% single\-hop F1,−\-14\.2% open\-domain F1\. This suggests that query decomposition introduces retrieval noise for well\-formed queries where the original query already captures intent precisely\. The technique may be better suited as a selective strategy triggered by query complexity heuristics rather than a default pipeline stage\.
#### 7\.3\.3RRF Weight Ablation
To validate the 70/30 vector/BM25 weighting used throughout our evaluation, we conducted a systematic ablation varying the RRF weight distribution across seven configurations \(30/70 through 80/20\)\. All other pipeline parameters—embedding model, reranker \(BGE\-2\), temporal boosting, top\_k=50—were held constant\. Table[7](https://arxiv.org/html/2604.19771#S7.T7)presents the full results\.
Table 7:RRF weight ablation on LoCoMo benchmark\. Seven vector/BM25 weight configurations evaluated with all other pipeline components held constant\. Best per\-column inbold\.70/30 leads on the hardest categories\.The two most challenging question categories—multi\-hop reasoning and temporal reasoning—are where RRF weight distribution has the most impact\. The 70/30 configuration achieves the highest multi\-hop F1 \(31\.51\), the highest temporal F1 \(62\.68\), and the highest temporal Judge score \(77\.26\)\. These categories require the system to connect facts across multiple memories and reason about time\-dependent information, precisely the scenarios where the balance between semantic linking \(vector\-dominant\) and lexical anchoring \(BM25 at 30%\) is critical\. The 30% BM25 weight provides sufficient keyword anchoring for entity names and temporal expressions while preserving the vector component’s cross\-memory semantic reasoning capacity\.
Overall variation is within noise, but per\-category breakdown reveals 70/30’s edge\.Across all seven configurations, single\-hop F1 ranges from 44\.03 to 49\.94 \(∼\\sim6 point spread\), while multi\-hop F1 ranges from 27\.18 to 31\.51 \(∼\\sim4 point spread\)\. The relatively flat overall performance confirms that the BGE\-2 reranker stabilizes end\-to\-end quality regardless of initial weight distribution\. However, the per\-category breakdown reveals that 70/30 uniquely excels on the categories with the widest performance variance: multi\-hop and temporal\. The nearest competitor, 75/25, trails 70/30 on multi\-hop F1 by 0\.69 points and on temporal F1 by 0\.57 points—small but consistent margins that compound across question types\.
Degradation at extremes\.BM25\-heavy configurations \(40/60, 30/70\) show pronounced degradation: single\-hop F1 drops by 8–10% \(from 48\.66 to 44\.03–44\.73\), and open\-domain Judge scores fall by 7–9 points \(from 85\.85 to 76\.69–78\.56\)\. This occurs because excessive BM25 weight overwhelms the semantic signal needed for paraphrase matching and broad entity queries\. Conversely, vector\-heavy configurations \(80/20\) lose multi\-hop F1 \(29\.87 vs\. 31\.51 for 70/30\) because reduced BM25 weight provides insufficient keyword grounding for entity\-specific fact retrieval across multiple memories\. The 70/30 configuration represents the sweet spot: enough BM25 to boost keyword\-specific temporal and entity queries, enough vector to preserve multi\-hop semantic reasoning\.
## 8Results on LongMemEval
We evaluate Cognis on LongMemEval across eight answer generation models \(Claude Opus 4\.6, Claude Sonnet 4\.6, Claude Haiku 4\.5, GPT\-5, GPT\-5\-mini, GPT\-4\.1, GPT\-4o, and Gemini 3 Flash\), using GPT\-4\.1 as the LLM judge for all evaluations\. We additionally compare against SuperMemory and Zep/Graphiti as external baselines\.
### 8\.1Detailed Results
Table[8](https://arxiv.org/html/2604.19771#S8.T8)presents Cognis’s performance across all six LongMemEval question types, with both accuracy \(end\-to\-end correctness\) and retrieval quality metrics \(GPT\-4\.1 judge\)\. Overall retrieval quality: P@K=0\.10, R@K=0\.83, F1@K=0\.17\.
Table 8:LongMemEval benchmark results by question type \(GPT\-4\.1 judge\)\. Accuracy measures end\-to\-end correctness; retrieval metrics assess whether relevant memories were retrieved\. Best accuracy inbold\.SS\-User \(100\.0%\)\.Perfect accuracy on single\-session user facts validates the full pipeline from context\-aware extraction through hybrid retrieval to answer generation\. When a user explicitly states a fact, Cognis reliably extracts, stores, and retrieves it\. The high Hit@K \(0\.91\) confirms that the retrieval stage surfaces the correct evidence in nearly all cases\.
SS\-Preference \(93\.3%\)\.Strong preference recall demonstrates the memory taxonomy’s effectiveness in classifying and retrieving personalization\-relevant memories\. The remarkably high Hit@K \(0\.97\) indicates that relevant preferences are almost always retrieved, with the small accuracy gap attributable to answer synthesis ambiguity rather than retrieval failure—the system finds the right memories but occasionally generates responses that do not precisely match the expected answer format\.
Knowledge Update \(92\.3%\)\.This result directly validates Cognis’s version chain architecture\. Theis\_currentflags andreplaces\_idlinks ensure that when information changes, only the latest version surfaces during retrieval\. The 92\.3% accuracy on knowledge updates demonstrates that context\-aware ingestion—where UPDATE operations replace outdated facts rather than creating contradictions—provides a structural advantage for handling evolving information that simpler memory systems lack\. The improved Hit@K \(0\.94\) reflects reliable retrieval of the correct version\.
SS\-Assistant \(87\.5%\)\.Assistant response recall reaches 87\.5%, with notably strong retrieval metrics: Hit@K=0\.96, MRR=0\.88, NDCG=0\.89—the highest retrieval quality across all question types\. This demonstrates that Cognis’s immediate recall index effectively surfaces assistant\-generated content\. As shown in Table[9](https://arxiv.org/html/2604.19771#S8.T9), Claude Opus 4\.6 further improves this category to 92\.9%\.
Multi\-Session \(86\.5%\)\.Cross\-session reasoning represents one of the hardest retrieval challenges, consistent with multi\-hop findings on LoCoMo\. The lower Hit@K \(0\.68\) pinpoints retrieval as the bottleneck: when facts span multiple conversations, the system must bridge semantic gaps across session boundaries\. RRF fusion helps—BM25 anchors on entity names that appear across sessions while vector search captures semantic relationships—but multi\-session reasoning remains an open challenge where explicit cross\-session linking could provide further gains\. Claude Opus 4\.6 pushes this to 87\.2% \(Table[10](https://arxiv.org/html/2604.19771#S8.T10)\)\.
Temporal Reasoning \(84\.2%\)\.Temporal performance reaches 84\.2%, while stronger answer models like Claude Opus 4\.6 achieve 92\.5% \(Table[10](https://arxiv.org/html/2604.19771#S8.T10)\)\. The Hit@K of 0\.80 suggests that time\-relevant memories are retrieved in most cases, with temporal boosting providing the critical ordering signal that surfaces the temporally correct memory\. The variance across answer models \(84\.2%–92\.5%\) indicates that temporal reasoning quality depends substantially on the answer generation model’s ability to synthesize time\-sensitive information from retrieved context\.
### 8\.2Comparative Results: Cross\-System
Table[9](https://arxiv.org/html/2604.19771#S8.T9)compares Cognis \(best per\-type across all answer models\) against SuperMemory and Zep/Graphiti, isolating the impact of memory architecture\. Best per\-row inbold\.
Table 9:Cross\-system accuracy \(%\) on LongMemEval\. Cognis column shows the best result across all eight answer models for each question type \(GPT\-4\.1 judge\)\. Best per\-row inbold\.Cognis leads overall by \+10\.8pp over SuperMemory and \+21\.2pp over Zep/Graphiti, with architectural advantages mapping cleanly to specific question types:
Preference handling \(\+23\.3pp over SuperMemory\)\.The 13\-category memory taxonomy and targeted retrieval pipeline excel at capturing and recalling user preferences\. SuperMemory’s graph\-based approach achieves only 70\.0% on preference questions, while Cognis reaches 93\.3%, suggesting that explicit semantic categorization provides more reliable preference retrieval than graph traversal\.
Knowledge updates \(\+7\.7pp over SuperMemory\)\.Version chains withis\_currentflags provide a structural advantage for handling evolving information\. While SuperMemory achieves a competitive 88\.5%, Cognis’s explicit version tracking reaches 96\.2% \(with Claude Opus 4\.6\), ensuring that outdated facts are reliably superseded\.
Temporal reasoning \(\+15\.8pp over SuperMemory\)\.Temporal boosting provides consistent gains across both LoCoMo and LongMemEval, confirming its generalizability\. Cognis achieves 92\.5% \(with Claude Opus 4\.6\) compared to SuperMemory’s 76\.7%\. The combination of temporal intent detection, time\-window scoring, and explicit event time metadata creates a robust temporal reasoning pipeline that neither SuperMemory nor Zep/Graphiti replicate\.
Multi\-session reasoning \(\+15\.8pp over SuperMemory\)\.Hybrid RRF retrieval’s combination of semantic and keyword matching bridges cross\-session gaps more effectively than single\-modality approaches\. Cognis achieves 87\.2% \(with Claude Opus 4\.6\) compared to SuperMemory’s 71\.4%\. BM25’s exact matching on entity names provides critical cross\-session anchoring that pure embedding\-based retrieval misses\.
SuperMemory leads SS\-Assistant\.SuperMemory’s 96\.4% on SS\-Assistant still leads Cognis’s best of 92\.9% \(Claude Opus 4\.6\) by 3\.5pp\. SuperMemory’s graph\-based approach captures assistant\-generated knowledge more effectively, though the gap is modest compared to Cognis’s large leads on other categories\.
### 8\.3Comparative Results: Across Answer Models
Table[10](https://arxiv.org/html/2604.19771#S8.T10)presents Cognis’s accuracy across eight answer generation models, all evaluated with GPT\-4\.1 as the LLM judge\. Best per\-column inbold\.
Table 10:Cognis accuracy \(%\) on LongMemEval across answer generation models \(GPT\-4\.1 judge\)\. Best per\-column inbold\.Consistent performance across all models\.Overall accuracy ranges from 83\.4% \(Gemini 3 Flash\) to 92\.4% \(Claude Opus 4\.6\), a 9\.0pp spread\. Critically, every configuration exceeds both SuperMemory \(81\.6%\) and Zep/Graphiti \(71\.2%\), demonstrating that Cognis’s architectural advantages—context\-aware extraction, hybrid retrieval, version chains, and temporal boosting—are robust to the choice of answer generation model\.
Best overall: Claude Opus 4\.6 \(92\.4%\)\.Claude Opus 4\.6 achieves the highest overall accuracy, leading on SS\-Assistant \(92\.9%\), knowledge updates \(96\.2%\), temporal reasoning \(92\.5%\), and multi\-session recall \(87\.2%\)\. Its strength on the hardest categories suggests that more capable answer generation models better leverage Cognis’s retrieved context for complex reasoning tasks\.
Model\-specific strengths\.Different answer models excel on different question types: GPT\-4\.1 and Claude Sonnet 4\.6 achieve perfect 100\.0% on SS\-User; Claude Opus 4\.6, GPT\-4\.1, and GPT\-4o share the lead on SS\-Preference at 93\.3%; GPT\-5 achieves 91\.7% temporal reasoning as the second\-strongest model on that category\. This variance indicates that question type performance depends on the interaction between retrieval quality and answer generation capability\.
Temporal reasoning scales with model capability\.Temporal accuracy ranges from 82\.7% \(GPT\-4\.1\) to 92\.5% \(Claude Opus 4\.6\), a 9\.8pp spread\. This suggests that temporal reasoning quality depends substantially on the answer generation model’s ability to synthesize time\-sensitive information from retrieved context, even when the retrieval pipeline provides the correct temporal evidence\.
### 8\.4Cross\-Benchmark Consistency
The architectural advantages that drive performance on LoCoMo—version chains for knowledge consistency, temporal boosting for time\-aware queries, hybrid retrieval for broad coverage—consistently translate to LongMemEval despite the benchmarks’ different evaluation methodologies and question distributions\. Cognis achieves up to 96\.2% on knowledge updates \(validating version chains\), 92\.5% on temporal reasoning \(validating temporal boosting\), and 87\.2% on multi\-session recall \(validating hybrid retrieval\)\. This cross\-benchmark consistency, sustained across eight different answer generation models, strengthens the validity of our architectural claims and suggests that these mechanisms address fundamental challenges in long\-term memory rather than exploiting benchmark\-specific patterns\.
## 9Discussion
### 9\.1Key Findings
Temporal reasoning benefits most from Cognis’s pipeline: Our strongest improvements appear on temporal questions \(\+32\.9% LLM Judge score over Mem0g, \+21\.6% F1\)\. This validates the design of Cognis’s temporal boosting mechanism and suggests that explicit time\-awareness is underexplored in existing memory systems\. While Zep\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\)provides some temporal awareness through session management, and Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1)\)stores timestamps, neither implements explicit temporal boosting during retrieval\. The combination of temporal intent detection, time\-window scoring, and BGE\-2 cross\-encoder reranking\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)creates a powerful pipeline for time\-sensitive queries\.
OpenSearch BM25 is the key architectural enabler: Our ablation studies reveal that switching the BM25 backend from MongoDB to OpenSearch produces the single largest performance gain across all architectural changes tested\. The \+20\.3% improvement on open\-domain Judge scores \(71\.34→\\rightarrow85\.85\) demonstrates that native BM25 with configurable text analysis—including proper tokenization, stemming, and term\-frequency weighting—is critical for broad entity queries where MongoDB’s simpler text indexing falls short\.
Hybrid retrieval outperforms single modality: The combination of vector and BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2604.19771#bib.bib22)\)retrieval through RRF fusion\(Cormacket al\.,[2009](https://arxiv.org/html/2604.19771#bib.bib13)\)consistently outperforms any single approach, consistent with findings in hybrid search literature\(Maet al\.,[2021](https://arxiv.org/html/2604.19771#bib.bib14)\)\. BM25 is particularly important for queries containing specific names, dates, or technical terms that semantic search might miss\. In our experiments, removing BM25 leads to noticeable performance degradation, confirming its complementary value\. This contrasts with systems like MemoryBank\(Zhonget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib9)\)and ReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib8)\)that rely primarily on embedding\-based retrieval\.
Context\-aware extraction prevents memory pollution: By retrieving similar existing memories before LLM extraction, Cognis’s ingestion pipeline makes intelligent decisions about ADD/UPDATE/DELETE/NONE operations\. This approach shares conceptual similarities with MemR3’s\(Duet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib6)\)reflective reasoning, but applies it at ingestion time rather than retrieval time\. The result is a cleaner memory store that prevents duplicates and maintains consistency as information evolves\.
Embedding model choice creates category\-level tradeoffs: Our ablation studies show that no single embedding model dominates across all question types\. Nomic Embed achieves the highest single\-hop F1 \(50\.23\) while Gemma Embed leads on Judge scores despite lower F1, suggesting a fundamental tradeoff between token\-level precision and semantic correctness\. This finding points toward ensemble or adaptive embedding strategies as a promising direction\.
Cross\-encoder reranking provides significant quality gains: Following insights from CADET\(Tamberet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib15)\)on cross\-encoder effectiveness, Cognis’s BGE\-2 reranker\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)provides substantial improvements on multi\-hop reasoning \(\+11\.1% F1 over the Zero Entropy alternative\) despite adding only 20\-50ms latency\. This suggests that initial bi\-encoder retrieval \(even with hybrid search\) benefits from cross\-encoder refinement for nuanced relevance judgments\.
Matryoshka embeddings provide efficiency without accuracy loss: Leveraging Matryoshka Representation Learning\(Kusupatiet al\.,[2022](https://arxiv.org/html/2604.19771#bib.bib16)\), Cognis’s two\-stage retrieval with 256D shortlisting followed by 768D re\-ranking reduces latency by approximately 50% while maintaining accuracy within 1\.4% of single\-stage search\. This enables scaling to larger memory stores without proportional latency increases\.
Comparison with operating system approaches: Unlike MemGPT’s\(Packeret al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib10)\)complex memory paging between “main memory” \(context\) and “disk” \(external storage\), Cognis uses a simpler dual\-store design where both stores are always accessible\. This reduces engineering complexity while achieving strong performance, suggesting that explicit OS\-style memory management may be unnecessary when retrieval quality is sufficiently high\.
Cross\-benchmark generalization validates architectural claims: The fact that version chains drive knowledge update accuracy \(up to 96\.2% on LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib20)\)\), temporal boosting drives temporal reasoning \(up to 92\.5%\), and hybrid retrieval drives multi\-session recall \(up to 87\.2%\) across two independent benchmarks with different evaluation methodologies provides strong evidence that these are genuine architectural advantages, not dataset\-specific artifacts\. LoCoMo tests single\-conversation recall across 50\+ sessions, while LongMemEval tests multi\-session interactive memory across 500 questions with 6 distinct question types—the consistency of results across these complementary evaluation frameworks, sustained across eight different answer generation models, strengthens the validity of our architectural claims\.
RRF weight distribution is robust but category\-sensitive: Our ablation across 7 weight configurations \(30/70 through 80/20\) shows overall F1 varies by only∼\\sim2%, but per\-category analysis reveals that 70/30 uniquely excels on the hardest categories \(multi\-hop F1=31\.51, temporal F1=62\.68\)\. This finding has practical implications: production deployments can use 70/30 as a reliable default without dataset\-specific tuning, and the BGE\-2 reranker stabilizes end\-to\-end quality regardless of the initial weight distribution, reducing sensitivity to this hyperparameter\.
### 9\.2Latency
Table[11](https://arxiv.org/html/2604.19771#S9.T11)reports end\-to\-end retrieval latency measured across 500 LongMemEval queries\. Cognis achieves a p50 of 250ms and a mean of 390ms, with p99 under 1 second\. This confirms that the full hybrid pipeline—Matryoshka two\-stage vector search, OpenSearch BM25, RRF fusion, temporal boosting, and BGE\-2 cross\-encoder reranking—remains practical for interactive applications despite its multi\-stage design\.
Table 11:Cognis end\-to\-end retrieval latency \(500 LongMemEval queries\)\.
### 9\.3Limitations
Temporal boosting scope: Temporal boosting is applied based on query analysis, not memory content analysis\. A fact stored on May 8th may incorrectly receive temporal boost for queries about “last week” based on storage date rather than content relevance\. Future work could refine temporal boosting to better distinguish time\-sensitive content\.
Reranker latency: The BGE\-2 reranker adds 20\-50ms latency, which may be unacceptable for extremely latency\-sensitive applications\. An adaptive approach that selectively applies reranking based on query complexity could help\.
Embedding model tradeoff: Our ablation studies reveal that no single embedding model dominates all question types, creating a tension between token\-level precision and semantic correctness\. Adaptive or ensemble embedding strategies remain unexplored\.
Query decomposition overhead: LLM\-based query decomposition introduces noise for simple queries while potentially helping complex multi\-hop questions\. A selective activation mechanism based on query complexity heuristics would be more effective than blanket application\.
Assistant response recall: LongMemEval evaluation reveals that Cognis’s SS\-Assistant accuracy varies across answer models, from 71\.4% \(GPT\-4o\) to 92\.9% \(Claude Opus 4\.6\)\. This reflects a design tradeoff: the ingestion pipeline prioritizes extracting user\-stated facts, meaning assistant responses are stored as raw messages in the immediate recall index but are not prominently extracted as structured memories\. SuperMemory’s graph\-based approach still leads \(96\.4%\), but the gap has narrowed to just 3\.5pp with Claude Opus 4\.6, suggesting that stronger answer generation models partially compensate for the extraction\-focused design\.
### 9\.4Future Work
Several promising directions emerge from our work:
- •Reflective memory management: Incorporate ideas from MemR3\(Duet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib6)\)and Hindsight Memory\(Latimeret al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib7)\)to enable agents to reason about their own memory contents, identifying gaps and contradictions proactively
- •Adaptive reranking: Following System 2 Attention\(Weston and Sukhbaatar,[2023](https://arxiv.org/html/2604.19771#bib.bib18)\)principles, selectively apply BGE\-2 reranking based on query complexity to optimize latency/quality tradeoffs
- •Gist\-based compression: Apply ReadAgent’s\(Leeet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib8)\)gist memory concept to create hierarchical summaries of memory contents for efficient broad\-context queries
- •Graph integration: Optionally enable knowledge graph storage inspired by SuperMemory\(SuperMemory AI,[2024](https://arxiv.org/html/2604.19771#bib.bib3)\)for complex multi\-hop reasoning scenarios requiring explicit relationship traversal
## 10Conclusion
We presented Lyzr Cognis, a memory architecture for conversational AI agents that addresses the fundamental limitation of LLM context windows by providing persistent, searchable memory across sessions\. Building on insights from cognitive science\(Tulving,[1972](https://arxiv.org/html/2604.19771#bib.bib23); Atkinson and Shiffrin,[1968](https://arxiv.org/html/2604.19771#bib.bib24)\)and recent advances in agent memory\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1); Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2); Packeret al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib10); Zhonget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib9)\), Cognis combines principled memory organization with state\-of\-the\-art retrieval techniques\.
Our key contributions include:
1. 1\.A comprehensive memory taxonomy with 15 semantic categories and 2 persistence scopes \(USER for cross\-session, CONTEXT for session\-specific\) for organizing conversational knowledge
2. 2\.A streamlined dual\-store architecture combining OpenSearch \(for documents and native BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2604.19771#bib.bib22)\)search\) with a vector database \(for Matryoshka\(Kusupatiet al\.,[2022](https://arxiv.org/html/2604.19771#bib.bib16)\)embeddings at 768D and 256D\)
3. 3\.A context\-aware ingestion pipeline that retrieves similar existing memories before LLM extraction, enabling intelligent ADD/UPDATE/DELETE/NONE decisions with full version tracking viais\_currentflags andreplaces\_idlinks—addressing a key limitation in existing systems where memory stores become polluted with duplicates
4. 4\.A hybrid retrieval pipeline using RRF fusion\(Cormacket al\.,[2009](https://arxiv.org/html/2604.19771#bib.bib13)\)\(70% vector \+ 30% BM25\), explicit temporal boosting for time\-sensitive queries, and a BGE\-2\(Chenet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib17)\)cross\-encoder reranker for final result refinement
Evaluated on the LoCoMo benchmark\(Maharanaet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib21)\)across four question types, Cognis achieves state\-of\-the\-art results compared to 11 baseline systems including Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib1)\), Zep\(Zep AI,[2024](https://arxiv.org/html/2604.19771#bib.bib2)\), MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib10)\), MemoryBank\(Zhonget al\.,[2023](https://arxiv.org/html/2604.19771#bib.bib9)\), ReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2604.19771#bib.bib8)\), A\-Mem\(Xuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib5)\), and Mem0g: 48\.66 F1 on single\-hop questions \(\+25\.7% over Mem0\), 31\.51 F1 on multi\-hop \(\+10\.0%\), 54\.77 F1 on open\-domain \(\+10\.5% over Zep\), and 62\.68 F1 on temporal questions \(\+21\.6% over Mem0g\)\. Our strongest gains appear on temporal reasoning, with a 77\.26 LLM Judge score \(\+32\.9% over Mem0g\), and on open\-domain semantic correctness, with an 85\.85 LLM Judge score \(\+12\.1% over Zep\), validating the effectiveness of OpenSearch BM25 integration, temporal boosting, and BGE\-2 cross\-encoder reranking\. Ablation studies further demonstrate that the choice of BM25 backend \(OpenSearch vs\. MongoDB\) is the single most impactful architectural decision, and that embedding model selection creates significant category\-level performance tradeoffs\.
Cross\-benchmark validation on LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2604.19771#bib.bib20)\)confirms these results generalize: Cognis achieves up to 92\.4% overall accuracy \(with Claude Opus 4\.6\), consistently outperforming SuperMemory \(81\.6%\) and Zep/Graphiti \(71\.2%\) across all eight answer generation models tested, with particular strength on knowledge updates \(up to 96\.2%—validating version chains\), temporal reasoning \(up to 92\.5%—validating temporal boosting\), and multi\-session recall \(up to 87\.2%—validating hybrid retrieval\)\. An RRF weight ablation across seven configurations confirms that the 70/30 vector/BM25 weighting provides the optimal balance between semantic coverage and keyword precision, achieving the best performance on the hardest question categories \(multi\-hop and temporal\) while the BGE\-2 reranker stabilizes overall quality across all weight distributions\.
The system is open\-source and deployed in production serving conversational AI applications\. We believe that explicit temporal awareness, context\-aware memory management, native BM25 search infrastructure, and hybrid retrieval combining multiple modalities represent important directions for future memory systems research\. As LLM agents become more capable, their memory systems must evolve to support the kind of long\-term, coherent interactions that humans naturally expect from intelligent assistants\.
## References
- Human memory: a proposed system and its control processes\.Psychology of Learning and Motivation2,pp\. 89–195\.Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§2\.6](https://arxiv.org/html/2604.19771#S2.SS6.p1.1),[§3\.2](https://arxiv.org/html/2604.19771#S3.SS2.p1.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)BGE m3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.03216\.Note:BGE\-2 Reranker available at[https://huggingface\.co/BAAI/bge\-reranker\-v2\-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)External Links:[Link](https://arxiv.org/abs/2402.03216)Cited by:[item 4](https://arxiv.org/html/2604.19771#S10.I1.i4.p1.1),[§2\.2](https://arxiv.org/html/2604.19771#S2.SS2.p2.1),[§2\.4](https://arxiv.org/html/2604.19771#S2.SS4.p2.1),[§5\.7](https://arxiv.org/html/2604.19771#S5.SS7.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p6.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.External Links:[Link](https://arxiv.org/abs/2504.19413)Cited by:[§1](https://arxiv.org/html/2604.19771#S1.p3.1),[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p1.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p1.1)\.
- G\. V\. Cormack, C\. L\. Clarke, and S\. Buettcher \(2009\)Reciprocal rank fusion outperforms condorcet and individual rank learning methods\.InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 758–759\.Cited by:[item 4](https://arxiv.org/html/2604.19771#S10.I1.i4.p1.1),[§2\.3](https://arxiv.org/html/2604.19771#S2.SS3.p1.1),[§5\.4](https://arxiv.org/html/2604.19771#S5.SS4.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p3.1)\.
- Y\. Deng, W\. Zhang, Z\. Chen, and Q\. Gu \(2023\)Rephrase and respond: let large language models ask better questions for themselves\.arXiv preprint arXiv:2311\.04205\.External Links:[Link](https://arxiv.org/abs/2311.04205)Cited by:[§2\.5](https://arxiv.org/html/2604.19771#S2.SS5.p1.1)\.
- X\. Du, L\. Li, D\. Zhang, and L\. Song \(2025\)MemR3: memory retrieval via reflective reasoning for llm agents\.arXiv preprint arXiv:2512\.20237\.Note:Code available at[https://github\.com/Leagein/memr3](https://github.com/Leagein/memr3)External Links:[Link](https://arxiv.org/abs/2512.20237)Cited by:[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p4.1),[1st item](https://arxiv.org/html/2604.19771#S9.I1.i1.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p4.1)\.
- J\. He, R\. H\. Bai, S\. Williamson, J\. Z\. Pan, N\. Jaitly, and Y\. Zhang \(2025\)CLaRa: bridging retrieval and generation with continuous latent reasoning\.arXiv preprint arXiv:2511\.18659\.Note:Code available at[https://github\.com/apple/ml\-clara](https://github.com/apple/ml-clara)External Links:[Link](https://arxiv.org/abs/2511.18659)Cited by:[§2\.2](https://arxiv.org/html/2604.19771#S2.SS2.p2.1)\.
- A\. Kusupati, G\. Bhatt, A\. Rege, M\. Wallingford, A\. Sinha, V\. Ramanujan, W\. Howard\-Snyder, K\. Chen, S\. Kakade, P\. Jain, and A\. Farhadi \(2022\)Matryoshka representation learning\.InAdvances in Neural Information Processing Systems,Cited by:[item 2](https://arxiv.org/html/2604.19771#S10.I1.i2.p1.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p5.1),[§2\.4](https://arxiv.org/html/2604.19771#S2.SS4.p1.1),[§4\.4](https://arxiv.org/html/2604.19771#S4.SS4.p1.2),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p7.1)\.
- C\. Latimer, N\. Boschi, A\. Neeser, C\. Bartholomew, G\. Srivastava, X\. Wang, and N\. Ramakrishnan \(2025\)Hindsight is 20/20: building agent memory that retains, recalls, and reflects\.arXiv preprint arXiv:2512\.12818\.External Links:[Link](https://arxiv.org/abs/2512.12818)Cited by:[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p4.1),[1st item](https://arxiv.org/html/2604.19771#S9.I1.i1.p1.1)\.
- K\. Lee, X\. Chen, H\. Sohn, N\. Nishida, D\. Hu, and H\. D\. Chang \(2024\)ReadAgent: a human\-inspired reading agent with gist memory of very long contexts\.arXiv preprint arXiv:2402\.09727\.External Links:[Link](https://arxiv.org/abs/2402.09727)Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p3.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1),[3rd item](https://arxiv.org/html/2604.19771#S9.I1.i3.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p3.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.Cited by:[§2\.2](https://arxiv.org/html/2604.19771#S2.SS2.p1.1)\.
- J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao \(2025\)SimpleMem: efficient lifelong memory for llm agents\.arXiv preprint arXiv:2601\.02553\.Note:Code available at[https://github\.com/aiming\-lab/SimpleMem](https://github.com/aiming-lab/SimpleMem)External Links:[Link](https://arxiv.org/abs/2601.02553)Cited by:[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p3.1)\.
- X\. Ma, K\. Sun, R\. Pradeep, and J\. Lin \(2021\)A replication study of dense passage retriever\.arXiv preprint arXiv:2104\.05740\.External Links:[Link](https://arxiv.org/abs/2104.05740)Cited by:[§2\.3](https://arxiv.org/html/2604.19771#S2.SS3.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p3.1)\.
- A\. Maharana, D\. Lee, S\. Tuber, M\. Jain, F\. Barbieri, and M\. Bansal \(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.7](https://arxiv.org/html/2604.19771#S2.SS7.p2.1),[§6\.1\.1](https://arxiv.org/html/2604.19771#S6.SS1.SSS1.p1.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.External Links:[Link](https://arxiv.org/abs/2310.08560)Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p2.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p8.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Foundations and Trends in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[item 2](https://arxiv.org/html/2604.19771#S10.I1.i2.p1.1),[§2\.3](https://arxiv.org/html/2604.19771#S2.SS3.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p3.1)\.
- SuperMemory AI \(2024\)SuperMemory: a memory system for llm agents\.Note:Knowledge graph\-based memory system for multi\-hop reasoningExternal Links:[Link](https://supermemory.ai/)Cited by:[§1](https://arxiv.org/html/2604.19771#S1.p3.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p1.1),[§6\.4\.2](https://arxiv.org/html/2604.19771#S6.SS4.SSS2.p1.1),[4th item](https://arxiv.org/html/2604.19771#S9.I1.i4.p1.1)\.
- M\. S\. Tamber, S\. Kazi, V\. Sourabh, and J\. Lin \(2025\)Conventional contrastive learning often falls short: improving dense retrieval with cross\-encoder listwise distillation and synthetic data\.arXiv preprint arXiv:2505\.19274\.External Links:[Link](https://arxiv.org/abs/2505.19274)Cited by:[§2\.3](https://arxiv.org/html/2604.19771#S2.SS3.p3.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p6.1)\.
- E\. Tulving \(1972\)Episodic and semantic memory\.InOrganization of Memory,E\. Tulving and W\. Donaldson \(Eds\.\),pp\. 381–403\.Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§2\.6](https://arxiv.org/html/2604.19771#S2.SS6.p1.1),[§3\.2](https://arxiv.org/html/2604.19771#S3.SS2.p1.1)\.
- J\. Weston and S\. Sukhbaatar \(2023\)System 2 attention \(is something you might need too\)\.arXiv preprint arXiv:2311\.11829\.External Links:[Link](https://arxiv.org/abs/2311.11829)Cited by:[§2\.5](https://arxiv.org/html/2604.19771#S2.SS5.p1.1),[2nd item](https://arxiv.org/html/2604.19771#S9.I1.i2.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p5.1),[§2\.7](https://arxiv.org/html/2604.19771#S2.SS7.p1.1),[§6\.1\.2](https://arxiv.org/html/2604.19771#S6.SS1.SSS2.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p9.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.External Links:[Link](https://arxiv.org/abs/2502.12110)Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p4.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1)\.
- Zep AI \(2024\)Zep: long\-term memory for ai assistants\.Note:Open\-source long\-term memory service for AI assistantsExternal Links:[Link](https://www.getzep.com/)Cited by:[§1](https://arxiv.org/html/2604.19771#S1.p3.1),[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p1.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1),[§6\.4\.2](https://arxiv.org/html/2604.19771#S6.SS4.SSS2.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2023\)MemoryBank: enhancing large language models with long\-term memory\.arXiv preprint arXiv:2305\.10250\.External Links:[Link](https://arxiv.org/abs/2305.10250)Cited by:[§10](https://arxiv.org/html/2604.19771#S10.p1.1),[§10](https://arxiv.org/html/2604.19771#S10.p4.1),[§2\.1](https://arxiv.org/html/2604.19771#S2.SS1.p2.1),[§6\.4\.1](https://arxiv.org/html/2604.19771#S6.SS4.SSS1.p1.1),[§9\.1](https://arxiv.org/html/2604.19771#S9.SS1.p3.1)\.
## Appendix AEvaluation Prompts
This appendix contains the actual prompts used in our LoCoMo benchmark evaluation\. These prompts are from our open\-source evaluation code\.
### A\.1LLM Judge Prompt
We use a generous grading approach adapted from Mem0’s evaluation methodology\. The judge grades answers as CORRECT if they touch on the same topic as the gold answer, with flexibility for time format variations:
Yourtaskistolabelananswertoaquestionas’CORRECT’
or’WRONG’\.Youwillbegiventhefollowingdata:
\(1\)aquestion\(posedbyoneusertoanotheruser\),
\(2\)a’gold’\(groundtruth\)answer,
\(3\)ageneratedanswer
whichyouwillscoreasCORRECT/WRONG\.
Thepointofthequestionistoaskaboutsomethingoneuser
shouldknowabouttheotheruserbasedontheirprior
conversations\.Thegoldanswerwillusuallybeaconcise
answerthatincludesthereferencedtopic,forexample:
Question:DoyourememberwhatIgotthelasttimeIwent
toHawaii?
Goldanswer:Ashellnecklace
Thegeneratedanswermightbemuchlonger,butyoushouldbe
generouswithyourgrading\-aslongasittouchesonthe
sametopicasthegoldanswer,itshouldbecountedas
CORRECT\.
Fortimerelatedquestions,thegoldanswerwillbea
specificdate,month,year,etc\.Thegeneratedanswermight
bemuchlongeroruserelativetimereferences\(like"last
Tuesday"or"nextmonth"\),butyoushouldbegenerouswith
yourgrading\-aslongasitreferstothesamedateortime
periodasthegoldanswer,itshouldbecountedasCORRECT\.
Eveniftheformatdiffers\(e\.g\.,"May7th"vs"7May"\),
consideritCORRECTifit’sthesamedate\.
Nowit’stimefortherealquestion:
Question:\\\{question\\\}
Goldanswer:\\\{gold\_answer\\\}
Generatedanswer:\\\{generated\_answer\\\}
First,provideashort\(onesentence\)explanationofyour
reasoning,thenfinishwithCORRECTorWRONG\.
DoNOTincludebothCORRECTandWRONGinyourresponse,
oritwillbreaktheevaluationscript\.
JustreturnthelabelCORRECTorWRONGinajsonformat
withthekeyas"label"\.
### A\.2Answer Generation Prompt
The general prompt used to generate answers from retrieved memories of two conversation speakers:
Youareanintelligentmemoryassistantretrieving
informationfromconversationmemories\.
CONTEXT:
Youhaveaccesstomemoriesfromtwospeakersina
conversation\.Thesememoriescontaintimestampedinformation
thatmayberelevanttoansweringthequestion\.
INSTRUCTIONS:
1\.Carefullyanalyzeallprovidedmemoriesfrombothspeakers
2\.Payspecialattentiontothetimestampstodetermine
theanswer
3\.Ifthequestionasksaboutaspecificeventorfact,
lookfordirectevidence
4\.Ifthememoriescontaincontradictoryinformation,
prioritizethemostrecentmemory
5\.Ifthereisaquestionabouttimereferences\(like
"lastyear","twomonthsago"\),calculatetheactual
datebasedonthememorytimestamp
6\.Alwaysconvertrelativetimereferencestospecific
dates,months,oryears
7\.Focusonlyonthecontentofthememoriesfromboth
speakers
8\.BeconcisebutCOMPLETE\.Forlists,includeALLitems\.
APPROACH\(Thinkstepbystep\):
1\.First,examineallmemoriesthatcontaininformation
relatedtothequestion
2\.Examinethetimestampsandcontentofthesememories
carefully
3\.Lookforexplicitmentionsofdates,times,locations,
oreventsthatanswerthequestion
4\.Iftheanswerrequirescalculation\(e\.g\.,converting
relativetimereferences\),showyourwork
5\.Formulateaprecise,conciseanswerbasedsolelyon
theevidenceinthememories
6\.Double\-checkthatyouranswerdirectlyaddressesthe
questionasked
7\.Ensureyourfinalanswerisspecificandavoidsvague
timereferences
Memoriesforuser\\\{\\\{speaker\_1\_user\_id\\\}\\\}:
\\\{\\\{speaker\_1\_memories\\\}\\\}
Memoriesforuser\\\{\\\{speaker\_2\_user\_id\\\}\\\}:
\\\{\\\{speaker\_2\_memories\\\}\\\}
Question:\\\{\\\{question\\\}\\\}
Answer:
### A\.3Single\-Hop Question Prompt \(Category 1\)
For questions requiring a specific fact from memories:
ThisisaSINGLE\-HOPquestionrequiringaspecificfact
frommemories\.
FOCUSONTHETOP1\-3MOSTRELEVANTMEMORIES\.Ignore
lower\-scoredones\.
RULES:
1\.Findthememorythatdirectlyanswersthequestion
2\.UseEXACTwords/phrasesfromthatmemory\(e\.g\.,
"Transgenderwoman"not"Trans"\)
3\.Forlists\(hobbies,activities,pets\):includeALL
itemsfromtherelevantmemory
4\.BeCOMPLETEbutCONCISE\-givethefullanswer,no
extraexplanation
5\.IGNOREmemoriesaboutdifferentevents/topics
Question:\\\{question\\\}
Completeanswerfromthemostrelevantmemory:
### A\.4Temporal Question Prompt \(Category 2\)
For questions asking WHEN something happened:
ThisisaTEMPORALquestionaskingWHENsomethinghappened\.
FOCUSONTHESINGLEMEMORYthatmentionstheEXACTevent
inthequestion\.Ignorememoriesaboutsimilarbut
DIFFERENTevents\.
FORMATRULES:
1\."howlongago"\-\>relativeterms\(e\.g\.,"10yearsago"\)
2\."when"\-\>specificdatefrommemory
3\.Useexactphrasinglike"TheweekbeforeX"ifmemory
saysthat
Question:\\\{question\\\}
Answer\(date/timefromthemostrelevantmemory\):
### A\.5Multi\-Hop Question Prompt \(Category 3\)
For questions requiring careful inference from multiple facts:
ThisisaMULTI\-HOPquestionrequiringcarefulinference
fromfacts\.
CRITICALINFERENCERULES:
1\."SupportingX"\!="BeingX"\(e\.g\.,supportingLGBTQ\!=
beingLGBTQmember\)
2\."Noexplicitmention"doesNOTmean"No"\-becareful
withassumptions
3\.For"WouldXbeconsideredamemberof\.\.\."\-\>lookfor
SELF\-identificationonly
4\.For"WouldXbeconsideredanally\.\.\."\-\>supporting
others=beinganally
5\.BaseanswersONLYonexplicitstatementsinmemories
For"WouldX\.\.\."questions:
\-Ifclearevidenceexists:"Yes"or"No"\+briefreason
\-Ifinferring:"Likelyyes"or"Likelyno"\+briefreason
\-Defaulttowhattheevidenceactuallyshows
Question:\\\{question\\\}
Answerbasedonevidence:
### A\.6Open\-Domain Question Prompt \(Category 4\)
For general knowledge questions requiring concise answers:
ThisisanOPEN\-DOMAINquestion\.
RULES:
1\.Answerin1\-5wordsMAXIMUM
2\.UseEXACTtermsfromthetop\-scoredmemory
3\.DoNOTaddextracontextorexplanation
4\.Nopunctuationattheend
Question:\\\{question\\\}
Conciseanswer:Similar Articles
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
SaliMory: Orchestrating Cognitive Memory for Conversational Agents
SaliMory is a framework that trains a single language model to manage cognitively-structured memory (user facts, preferences, and working memory) for conversational agents, using hierarchical stage-wise process rewards and reward-decomposed contrastive refinement. It reduces memory-attributed failures by one-third, outperforms state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.
CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents
CoreMem proposes a resource-efficient edge-cloud memory architecture for dialogue agents, using Riemannian retrieval with a Fisher-Rao metric and Fisher-guided discrete token distillation to achieve strong accuracy improvements within an 8 GB VRAM budget.
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
SuperLocalMemory V3.3 introduces a unified memory and learning system for AI agents with biologically-inspired forgetting, multi-channel retrieval, and P2P mesh coordination. The system achieves 74.8% on LoCoMo benchmarks and features triple-stream learning, lifecycle management, and EU AI Act compliance.
rohitg00/agentmemory
agentmemory is an open-source persistent memory layer for AI coding agents (Claude Code, Cursor, Gemini CLI, Codex CLI, etc.) that uses knowledge graphs, confidence scoring, and hybrid search to give agents long-term memory across sessions via MCP, hooks, or REST API. Built on the iii engine, it requires no external databases and exposes 51 MCP tools.