Belief Memory: Agent Memory Under Partial Observability

arXiv cs.AI Papers

Summary

This paper introduces BeliefMem, a novel memory paradigm for LLM agents that stores multiple candidate conclusions with probabilities to handle partial observability and reduce self-reinforcing errors. Empirical evaluations show it outperforms deterministic baselines on LoCoMo and ALFWorld benchmarks.

arXiv:2605.05583v1 Announce Type: new Abstract: LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:28 AM

# Agent Memory Under Partial Observability
Source: [https://arxiv.org/html/2605.05583](https://arxiv.org/html/2605.05583)
Junfeng Liao1, Qizhou Wang2, Jianing Zhu3, Bo Du4, Rui Yan4, Xiuying Chen1

1MBZUAI2RIKEN AIP3UT Austin4Wuhan University

###### Abstract

LLM agents that operate over long context depend on external memory to accumulate knowledge over time\. However, existing methods typically store each observation as a single deterministic conclusion \(e\.g\., inferring “API X failed” from temporary errors\), even though such observations are inherently partial and potentially ambiguous\. By committing to one conclusion and discarding uncertainty, these methods introduce*self\-reinforcing error*: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time\. To address this issue, we propose*BeliefMem*, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities\. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy\-OR rules as new observations arrive\. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent\. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well\-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives\. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well\-known baselines\. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments\.

## 1Introduction

*Large language model*\(LLM\) agents deployed in long\-horizon, multi\-session tasks increasingly rely on persistent external memory to accumulate knowledge across interactions\(Huet al\.,[2025b](https://arxiv.org/html/2605.05583#bib.bib13); Du,[2026](https://arxiv.org/html/2605.05583#bib.bib14)\)\.*Factual memory*methods store observations about users and environments as structured entries, from natural\-language memory streams\(Parket al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib2)\)to vector\-based extracted facts\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib12)\)\. While these methods record what was observed,*self\-improving memory*methods distill actionable lessons from past experience, from natural\-language reflections\(Shinnet al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib1); Zhaoet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib10)\)to reusable skill libraries\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.05583#bib.bib19)\)\. Despite this diversity, these methods share a common paradigm: every memory entry is stored as a single deterministic conclusion inferred from observations, and every operation over it produces an all\-or\-nothing outcome\.

This deterministic paradigm results in errors that persist over time\. Consider an agent that observes repeated API X timeouts \(Figure[1](https://arxiv.org/html/2605.05583#S1.F1)\): since each memory entry holds only a single categorical conclusion, the agent stores “API X failed” while the possibility of transient failure \(e\.g\., temporary rate limiting\) is permanently discarded\. Self\-improving methods amplify this problem by distilling experience such as “avoid API X,” and even methods that update entries cannot escape, as correcting to “API X is operational” merely replaces one deterministic conclusion with another and the next transient error flips it right back\. Furthermore, when such flawed conclusions conflict with user instructions \(e\.g\., “Use API X to …”\), the agent struggles to act reliably\(Huet al\.,[2025a](https://arxiv.org/html/2605.05583#bib.bib37)\)\. We refer to this issue as*self\-reinforcing error*: the agent acts on stored conclusions, generating observations that further evidence them\(Shaoet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib20); Lamet al\.,[2026](https://arxiv.org/html/2605.05583#bib.bib21)\)\.

Fundamentally, these agents operate in a*partially observable Markov decision process*\(POMDP\): they never directly access the true state of the world but only receive partial, noisy observations such as user messages and tool outputs\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.05583#bib.bib15)\)\. For instance, whether API X is permanently down or temporarily rate\-limited is a hidden state that must be inferred from observations\. Yet existing deterministic memory methods equate each observation with ground truth, leaving alternative hypotheses unrepresented and allowing self\-reinforcing errors to persist across sessions \(Figure[1](https://arxiv.org/html/2605.05583#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2605.05583v1/x1.png)Figure 1:Deterministic memory vs\. BeliefMem with an API timeout example\. After repeated API X timeouts, the deterministic paradigm stores “API X failed” and avoids it in later sessions, reinforcing the error\. In contrast, BeliefMem keeps multiple hypotheses \(e\.g\., failure vs\. rate limiting\) with probabilities, retries the API, and updates beliefs with new evidence, enabling correction over time\.To bridge this gap, we proposeBeliefMem, which fundamentally shifts the memory paradigm from storing deterministic conclusions to maintaining an attribute\-level belief representation over the environment\. Specifically, BeliefMem maintains active candidate conclusions for each piece of stored knowledge, assigning each conclusion a probability updated via noisy\-OR evidence merge as new observations arrive\. At retrieval, the candidate conclusions of each latent state surface with their probabilities, keeping competing hypotheses visible to the agent instead of reducing them to a single deterministic conclusion\. This combination of belief\-aware memory storage and probability\-aware retrieval directly mitigates self\-reinforcing error at its root: the alternative conclusions that the deterministic paradigm discards during the storage phase are now preserved and accessible to the agent\. For example, in Figure[1](https://arxiv.org/html/2605.05583#S1.F1), repeated timeouts on API X keep candidate conclusions viable alongside permanent failure\. Therefore, the agent can revisit previously unfavorable actions in the future, and each new observation incrementally refines the probability assignment of each conclusion, strengthening well\-supported conclusions and downweighting those with weak evidence\.

To evaluate BeliefMem, we conduct experiments on both LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib7)\)and ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2605.05583#bib.bib30)\)benchmarks, from long\-term conversation to embodied agent interaction settings\. Empirical evaluations show that our method achieves the best average performance on both benchmarks, outperforming existing memory methods, even with limited memory corpus size\. Furthermore, ablation studies and adversarial experiments confirm the effectiveness of BeliefMem in preserving uncertainty and refining memories\. More broadly, these results demonstrate that replacing deterministic memory entries with probabilistic belief representation yields promising gains, exploring a new direction for agent memory paradigm in partially observable environments\.

## 2Related Work

### 2\.1Factual and RL\-Based Memory

Factual and RL\-based memory methods follow the deterministic paradigm, reducing each observation’s candidate conclusions to a single categorical one and discarding the alternatives\. Within this shared paradigm, early factual memory methods differ mainly in how they organize and access stored entries\. Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib2)\)maintains a natural language memory stream and retrieves memories with various signals, whereas MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib4)\)manages memories across context, recall, and storage through virtual context management\. Subsequent work further improves extraction, organization, and retrieval without changing the underlying representation: Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib12)\)dynamically extracts and consolidates salient facts for vector\-based retrieval, and A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib16)\)organizes memories as structured notes with indexing and linking\. Other work enriches the storage structure itself, with MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib6)\)updating retrieval strength with a forgetting curve, Zep\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib22)\)preserving evolving information in a temporal knowledge graph, and MemOS\(Liet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib23)\)unifying heterogeneous memory blocks within a single system\. Meanwhile, RL\-based memory methods replace this hand\-crafted memory management with learnable policies to add/update/delete entries, including Memory\-R1\(Yanet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib11)\), MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib24)\), Agentic Memory\(Yuet al\.,[2026](https://arxiv.org/html/2605.05583#bib.bib5)\), and MemRL\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.05583#bib.bib18)\)\. Across these studies, the main differences lie in storage management and retrieval strategy, not in memory representation, where each memory entry generally still records only one categorical conclusion inferred from noisy and ambiguous observations\.

### 2\.2Self\-improving Memory

Beyond recording factual observations, self\-improving memory methods store actionable lessons distilled from past experience to instruct the agent’s subsequent actions\. There are several studies that summarize raw experience into verbal lessons, such as Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib2)\)summarizing interaction history as reflective memory, Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib1)\)generating self\-corrective guidance from failed experiences, and ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib10)\)aggregating recurring patterns across trajectories into reusable insights\. Beyond verbal lessons, concurrent work records feasible actions in growing skill libraries\. Voyager\(Wanget al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib3)\)expands the library through an automatic curriculum as the agent explores new environments, and MemSkill\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.05583#bib.bib19)\)constructs a set of skills that transfer reusable knowledge across related problems\. Despite shifting from factual observations to distilled experience, these methods retain the same deterministic paradigm, storing each lesson as a single categorical entry while ignoring uncertainty in observations\.

### 2\.3Belief State under Partial Observability

In the standard POMDP, uncertainty under partial observability is represented by a belief state, a probability distribution over hidden states conditioned on the observation history\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.05583#bib.bib15)\)\. Recent work views LLM agents as operating under partial observability and uses belief based representations for action selection and coordination\(Lidayanet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib25); Jianget al\.,[2026](https://arxiv.org/html/2605.05583#bib.bib26); Wanget al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib17)\)\. Additionally, Belief Engine\([Yanget al\.,](https://arxiv.org/html/2605.05583#bib.bib27)\)externalizes and updates beliefs in a specific multi\-agent debate setting, and empirical work shows that the mismatch between agent’s beliefs and the true states of the environment can result in unreliable opinions and actions\(Genget al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib29)\)\. However, existing memory systems still ignore the key implication of such partial observability, namely that an agent’s observations provide only partial evidence about hidden states \(e\.g\., user preference\) rather than direct access to the true states\. As a result, memory is represented as deterministic conclusions inferred from noisy observations, collapsing their uncertainty into a single ground truth\. This motivates a memory representation that preserves such uncertainty instead of storing each memory entry as ground truth\.

## 3Methodology

### 3\.1Problem Formulation

POMDP \(Partially Observable Markov Decision Process\) setting\.We consider an agent interacting with partially observable environments\. At decision timett, the agent has access to an observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}and selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}\. Letst∈𝒮s\_\{t\}\\in\\mathcal\{S\}denote the latent environment state at timett, the environment transitions according tost\+1∼T\(⋅∣st,at\)s\_\{t\+1\}\\sim T\(\\cdot\\mid s\_\{t\},a\_\{t\}\)\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.05583#bib.bib15)\)\. Bayes\-optimal action selection depends on the belief state, i\.e\., the posterior distribution over latent states induced by the interaction history\. Definingηt:=\(o1:t,a1:t−1\)\\eta\_\{t\}:=\(o\_\{1:t\},a\_\{1:t\-1\}\), we write:

bt\(s\):=Pr\(st=s∣ηt\),bt∈Δ\(𝒮\),at∼π\(⋅∣bt\)\.b\_\{t\}\(s\):=\\Pr\(s\_\{t\}=s\\mid\\eta\_\{t\}\),\\qquad b\_\{t\}\\in\\Delta\(\\mathcal\{S\}\),\\qquad a\_\{t\}\\sim\\pi\(\\cdot\\mid b\_\{t\}\)\.\(1\)Therefore,btb\_\{t\}is a sufficient statistic of the action\-observation history for action selection\.

External Memory as Belief Approximation\.Existing memory methods can be viewed as approximatingbtb\_\{t\}through an external memory moduleMtM\_\{t\}, which compresses task\-relevant information from past interactions into a retrievable structure\. Attt, the agent queriesMtM\_\{t\}with the current observationoto\_\{t\}to obtain the memory context:

zt=Read​\(Mt,ot\),z\_\{t\}=\\mathrm\{Read\}\(M\_\{t\},o\_\{t\}\),\(2\)and selects an action conditioned on both the observation and the retrieved context:at∼π\(⋅∣ot,zt\)a\_\{t\}\\sim\\pi\(\\cdot\\mid o\_\{t\},z\_\{t\}\)\. After executingata\_\{t\}and observingot\+1o\_\{t\+1\}, the memory is updated as:

Mt\+1=Update​\(Mt,ot,ot\+1\),M\_\{t\+1\}=\\mathrm\{Update\}\(M\_\{t\},o\_\{t\},o\_\{t\+1\}\),\(3\)whereUpdate\\mathrm\{Update\}encompasses memory writing and management operations\(Xuet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib16); Yanet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib11)\)\. In this way,MtM\_\{t\}serves as a tractable approximation of the belief state, supporting future decisions without maintaining the inaccessible full posterior\.

### 3\.2Motivation

The Deterministic Bottleneck\.However, in practice, many existing memory methods store point estimates of latent attributes relevant to the task, i\.e\., a deterministic conclusion of each attribute inferred from observations, thus discarding uncertainty that would be retained in a representation of complete beliefbt​\(s\)b\_\{t\}\(s\)\. Letccdenote a task\-relevant attribute of the latent state \(e\.g\., user preference, tool status, or object\-location relation\), and letℋ​\(c\)=\{h1\(c\),…,hMc\(c\)\}\\mathcal\{H\}\(c\)=\\\{h\_\{1\}^\{\(c\)\},\\dots,h\_\{M\_\{c\}\}^\{\(c\)\}\\\}denote a set of mutually exclusive and collectively exhaustive hypotheses representing the possible conclusions ofcc\. A reliable memory would maintain, for eachcc, a local posterior:

bt\(c\)​\(h\):=Pr⁡\(st∈h∣o1:t,a1:t−1\)=∑s∈hbt​\(s\),h∈ℋ​\(c\)\.b\_\{t\}^\{\(c\)\}\(h\):=\\Pr\(s\_\{t\}\\in h\\mid o\_\{1:t\},a\_\{1:t\-1\}\)=\\textstyle\\sum\_\{s\\in h\}b\_\{t\}\(s\),\\qquad h\\in\\mathcal\{H\}\(c\)\.\(4\)However, in the deterministic memory paradigm, the write operation stores only a single conclusionh^t​\(c\)\\hat\{h\}\_\{t\}\(c\)rather than the full local posterior:

Mt=\{\(c,h^t​\(c\)\):c∈𝒞t\},h^t​\(c\)∈ℋ​\(c\),M\_\{t\}=\\\{\(c,\\hat\{h\}\_\{t\}\(c\)\):c\\in\\mathcal\{C\}\_\{t\}\\\},\\qquad\\hat\{h\}\_\{t\}\(c\)\\in\\mathcal\{H\}\(c\),\(5\)where𝒞t\\mathcal\{C\}\_\{t\}denotes all attributes preserved inMtM\_\{t\}\. Fundamentally, this corresponds to writing the most probable attribute\-level hypothesis,h^t​\(c\)∈argmaxh∈ℋ​\(c\)⁡bt\(c\)​\(h\)\\hat\{h\}\_\{t\}\(c\)\\in\\operatorname\{argmax\}\_\{h\\in\\mathcal\{H\}\(c\)\}b\_\{t\}^\{\(c\)\}\(h\), while discarding the remaining alternatives and their associated probabilities\. Therefore,MtM\_\{t\}in current methods is a collection of attribute\-level point estimates rather than a probabilistic approximation to the full belief statebtb\_\{t\}, and the discarded uncertainty is no longer available for subsequent retrieval or update\.

Self\-Reinforcing Error\.This point estimate can induce self\-reinforcing error\. Suppose the retrieved memoryztz\_\{t\}in Eq\.[2](https://arxiv.org/html/2605.05583#S3.E2)exposes a stored conclusion\(c,h^t​\(c\)\)∈Mt\(c,\\hat\{h\}\_\{t\}\(c\)\)\\in M\_\{t\}\. The agent selectsata\_\{t\}conditioned onoto\_\{t\}andztz\_\{t\}, and the resulting transition\(ot,at,ot\+1\)\(o\_\{t\},a\_\{t\},o\_\{t\+1\}\)is written back to memory via Eq\.[3](https://arxiv.org/html/2605.05583#S3.E3)\. Since memory retains no posterior support for alternative hypotheses inℋ​\(c\)∖\{h^t​\(c\)\}\\mathcal\{H\}\(c\)\\setminus\\\{\\hat\{h\}\_\{t\}\(c\)\\\}, the agent is unlikely to select actions that would test these alternatives\. Ifh^t​\(c\)\\hat\{h\}\_\{t\}\(c\)is incorrect or prematurely consolidated, the agent instead collects further evidence consistent with the flawed conclusion, reinforcing it over time\. For example, if memory stores “API X failed,” the agent becomes less likely to retry the API, thereby missing observations that could contradict the stored memory entry\. Once uncertainty is collapsed to a point estimate, posterior support for discarded alternatives cannot be reconstructed from memory alone and must be re\-established through entirely new evidence\. This motivates a memory paradigm that retains a belief over the uncertainty rather than collapsing it to a point estimate\.

### 3\.3Belief Memory

Belief\-based Memory Formulation\.To bridge this gap, we propose BeliefMem, which replaces the deterministic paradigm with an attribute\-level belief representation that approximates the belief statebt\(c\)b\_\{t\}^\{\(c\)\}\. We first introduce the ideal representation for each memory entry:

\(c,bt\(c\)\),bt\(c\):ℋ​\(c\)→\[0,1\],s\.t\.​∑h∈ℋ​\(c\)bt\(c\)​\(h\)=1,\\big\(c,\\;b\_\{t\}^\{\(c\)\}\\big\),\\qquad b\_\{t\}^\{\(c\)\}:\\mathcal\{H\}\(c\)\\to\[0,1\],\\quad\\text\{s\.t\.\}\\textstyle\\sum\_\{h\\in\\mathcal\{H\}\(c\)\}b\_\{t\}^\{\(c\)\}\(h\)=1,\(6\)wherebt\(c\)b\_\{t\}^\{\(c\)\}denotes the distribution of all possible conclusions for attributeccat timett\. Therefore, the ideal representation of memory in BeliefMem is:

Mt=\{\(c,bt\(c\)\):c∈𝒞t\},M\_\{t\}\\;=\\;\\big\\\{\\big\(c,\\,b\_\{t\}^\{\(c\)\}\\big\):c\\in\\mathcal\{C\}\_\{t\}\\big\\\},\(7\)which replaces the deterministic collection of point estimates in Eq\.[5](https://arxiv.org/html/2605.05583#S3.E5)with a belief state which can represent the uncertainty of each attribute in the environment\.

In practice, this idealized representation is not directly feasible, because the conclusion space associated with an attribute is open\-ended or dynamically expanding, making exact posterior maintenance over the full set impractical\. This leads to two practical challenges: i\) In open\-ended settings,ℋ​\(c\)\\mathcal\{H\}\(c\)is not fixed and may expand online as new candidate conclusions are generated\. A fully normalized distribution over all candidate conclusions is therefore difficult to define\. ii\) Even under a fixed set of possible conclusions, updating the distribution for all candidates after each new observation is computationally expensive\. Therefore, classical POMDP methods rely on approximate belief representations, such as representative belief points, rather than exact update across all possible states\.

![Refer to caption](https://arxiv.org/html/2605.05583v1/x2.png)Figure 2:Overview of BeliefMem\. i\) Upon receiving an observation, BeliefMem updates memories via Add \(initializing candidates for new attributes\) or Merge \(incorporating new evidence via noisy\-OR update\)\. ii\) Retrieval scores entries by semantic similarity and temporal decay, returning a full belief rather than a single conclusion\. iii\) The agent acts conditioned on both the current observation and the retrieved belief, keeping all alternative hypotheses visible at decision time\.Belief Update in Memory\.To overcome these challenges, in this work, BeliefMem leverages two coupled ways to practically approximate Eq\.[6](https://arxiv.org/html/2605.05583#S3.E6)\. First, for each attributecc, BeliefMem only stores candidates that previous observations have actually evidenced, so preserved conclusions grow with evidence rather than with\|ℋ​\(c\)\|\|\\mathcal\{H\}\(c\)\|, and unseen candidates incur no storage or update cost\. Specifically, for each observationoto\_\{t\}, the agent identifies the supported hypotheses to form the subsetℋsub​\(c\):=\{h∈ℋ​\(c\)∣h​is supported by​ot\}\\mathcal\{H\}\_\{\\rm sub\}\(c\):=\\\{h\\in\\mathcal\{H\}\(c\)\\mid h\\text\{ is supported by \}o\_\{t\}\\\}\. The agent then assigns to eachh∈ℋsub​\(c\)h\\in\\mathcal\{H\}\_\{\\rm sub\}\(c\)a probabilitypt\(c\)​\(h\)∈\[0,1\]p\_\{t\}^\{\(c\)\}\(h\)\\in\[0,1\]measuring how strongly “hhis true” underoto\_\{t\}, and stores each resulting \(attributecc, candidatehh, probabilitypt\(c\)​\(h\)p\_\{t\}^\{\(c\)\}\(h\)\) into memory bank\. Secondly, BeliefMem keeps per\-candidate probability\{pt\(c\)​\(h\)\}\\\{p\_\{t\}^\{\(c\)\}\(h\)\\\}instead of a normalized joint posterior overℋsub​\(c\)\\mathcal\{H\}\_\{\\rm sub\}\(c\)to prevent the magnitude of a supportedhhfrom being affected by the number of alternative candidates sharing the samecc\. Thus, these values are evidence\-based probabilities rather than posterior probabilities over mutually exclusive hypotheses\. While stored independently, these probabilities are updated jointly whenever new evidence forccis observed\.

Building upon these principles, BeliefMem dynamically maintainsMtM\_\{t\}via the following operations:

Addworks when the agent reports a new attributec′∉𝒞tc^\{\\prime\}\\notin\\mathcal\{C\}\_\{t\}and stores a new entry inMtM\_\{t\},

\(c′,h,pt\+1\(c′\)​\(h\)\),pt\+1\(c′\)​\(h\)∈\[pmin,pmax\],\\big\(c^\{\\prime\},\\;h,\\;p\_\{t\+1\}^\{\(c^\{\\prime\}\)\}\(h\)\\big\),\\qquad p\_\{t\+1\}^\{\(c^\{\\prime\}\)\}\(h\)\\in\[p\_\{\\rm min\},\\,p\_\{\\rm max\}\],\(8\)whereh∈ℋsub​\(c′\)h\\in\\mathcal\{H\}\_\{\\rm sub\}\(c^\{\\prime\}\)is supported by the current observationot\+1o\_\{t\+1\}\.pminp\_\{\\rm min\}andpmaxp\_\{\\rm max\}constrain the probability of each new conclusion\. The details of extracting attributes are provided in Appendix[A\.1](https://arxiv.org/html/2605.05583#A1.SS1)\.

Mergeactivates when the new observation supports an attributec∈𝒞tc\\in\\mathcal\{C\}\_\{t\}already present inMtM\_\{t\}\. For any candidate conclusionhh, if the observation provides supporting evidence, its belief is updated via noisy\-OR evidence merge:

pt\+1\(c\)​\(h\)=min⁡\(1−\(1−pt\(c\)​\(h\)\)​\(1−Δ​\(ot\+1,h\)\),0\.99\),p\_\{t\+1\}^\{\(c\)\}\(h\)=\\min\\\!\\Bigl\(1\-\\bigl\(1\-p\_\{t\}^\{\(c\)\}\(h\)\\bigr\)\\bigl\(1\-\\Delta\(o\_\{t\+1\},h\)\\bigr\),\\;0\.99\\Bigr\),\(9\)whereΔ​\(ot\+1,h\)∈\[0,1\]\\Delta\(o\_\{t\+1\},h\)\\in\[0,1\]quantifies the strength of evidence provided byot\+1o\_\{t\+1\}for the stored conclusionhh\(details are provided in Appendix[A\.1](https://arxiv.org/html/2605.05583#A1.SS1)\)\. The upper bound of 0\.99 prevents any candidate from being stored with certainty\. After*Merge*, BeliefMem archives the old versionpt\(c\)p\_\{t\}^\{\(c\)\}for later retrieval\. Additionally, if the observation supports a competing candidate for the same attributecc, its probability would be reduced to0\.250\.25\. Details are presented in Appendix[A\.2](https://arxiv.org/html/2605.05583#A1.SS2)\.

Belief\-aware Retrieval\.Belief update alone is insufficient if retrieval discards the uncertainty that storage has carefully preserved\. To close this gap, retrieval is redefined as an operation conditioned on the stored belief rather than on a single chosen conclusion\. Specifically, given an observationoto\_\{t\}, the retrieval score of each entry is:

αt​\(c\)=sim​\(ot,c\)⋅λτt​\(c\),λ∈\(0,1\],\\alpha\_\{t\}\(c\)\\;=\\;\\mathrm\{sim\}\(o\_\{t\},\\,c\)\\cdot\\lambda^\{\\tau\_\{t\}\(c\)\},\\qquad\\lambda\\in\(0,1\],\(10\)wheresim​\(⋅\)∈ℝ≥0\\mathrm\{sim\}\(\\cdot\)\\in\\mathbb\{R\}\_\{\\geq 0\}measures the relevance ofcctooto\_\{t\}through semantic similarity\. The specific choice ofsim\\mathrm\{sim\}is shown in Appendix[A\.3](https://arxiv.org/html/2605.05583#A1.SS3)\.λ∈\[0,1\]\\lambda\\in\[0,1\]is the decay rate to control temporal importance during retrieval\.τt​\(c\)∈ℕ\\tau\_\{t\}\(c\)\\in\\mathbb\{N\}denotes the staleness of entrycc\(i\.e\., the time elapsed since its last update\)\. It increases by one at each time step, unless the corresponding entryccis updated by*Add*or*Merge*, in which case it resets to0\. Thus, an entry’s retrieval priority decays with staleness, and its underlying probability mass remains unchanged\.Read\\mathrm\{Read\}then selects the top\-KKentries byαt​\(c\)\\alpha\_\{t\}\(c\)and returns

rt=\{\(c,pt\(c\)\):c∈TopKα​\(Mt,ot\)\},r\_\{t\}\\;=\\;\\big\\\{\\big\(c,\\,p\_\{t\}^\{\(c\)\}\\big\):c\\in\\mathrm\{TopK\}\_\{\\alpha\}\(M\_\{t\},o\_\{t\}\)\\big\\\},\(11\)so that each retrieved attribute has its candidate probabilities overℋsub​\(c\)\\mathcal\{H\}\_\{\\rm sub\}\(c\)\. The agent then selects an action asat∼π\(⋅∣ot,rt\)a\_\{t\}\\sim\\pi\(\\cdot\\mid o\_\{t\},r\_\{t\}\), and every alternative conclusion inℋsub​\(c\)\\mathcal\{H\}\_\{\\rm sub\}\(c\)is now accessible to the agent with its confidence, rather than being erased at storage time in the deterministic paradigm\.

Overall, BeliefMem mitigates self\-reinforcing error through two coupled principles\. Specifically, it preserves memory as an approximated belief representation and returns the candidate beliefs at retrieval so that alternative hypotheses remain visible to the agent at decision time\.

## 4Experiments

Table 1:LoCoMo results across four categories under GPT\-4o\-mini and GPT\-4o backbones\. Each cell reports F1 / BLEU\-1\. Best and second numbers per column are inboldandunderline, respectively\.Benchmarks\.We conduct experiments on two benchmarks to evaluate long\-term memory capabilities of BeliefMem in both long\-term conversation and embodied agent interaction settings: i\)*LoCoMo*\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib7)\), a long\-term conversational memory benchmark whose dialogues contain roughly 9,000 tokens on average and up to 35 sessions, stressing multi\-session retrieval and temporal reasoning\. Following this, we evaluate along four question categories:*single\-hop*, which asks the model to extract a specific fact from a single session;*multi\-hop*, which requires composing information scattered across multiple sessions;*temporal reasoning*, which evaluates ordering and duration of events along the dialogue timeline; and*open\-domain*, which demands combining the contextual history with external commonsense knowledge\. We report F1 for token\-level precision and recall, and BLEU\-1 for lexical overlap against ground\-truth answers\. ii\)*ALFWorld*\(Shridharet al\.,[2020](https://arxiv.org/html/2605.05583#bib.bib30)\), a text\-based embodied benchmark whose tasks cover six household goal categories\. The evaluation is split into an in\-distribution*Seen*set and an out\-of\-distribution*Unseen*set whose room layouts and object instances are held out from training, so the latter directly probes memory transfer rather than pattern memorization\. We report success rate \(SR\), the fraction of tasks whose goal condition is satisfied within a 50\-step horizon, and the average step used on solved episodes, followingZhanget al\.\([2026a](https://arxiv.org/html/2605.05583#bib.bib19)\)\. More details on the evaluations are provided in the Appendix[B\.1](https://arxiv.org/html/2605.05583#A2.SS1)\.

Baselines\.On LoCoMo, we compare BeliefMem with six well\-known memory methods: LoCoMo baseline fromMaharanaet al\.\([2024](https://arxiv.org/html/2605.05583#bib.bib7)\), ReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib33)\), MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib6)\), MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2605.05583#bib.bib4)\), A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib16)\), and Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib12)\)\.∗denotes official reported performance\. On ALFWorld, we extend these baselines with LangMem\(LangChain,[2025](https://arxiv.org/html/2605.05583#bib.bib35)\)and MemoryOS\(Kanget al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib34)\), and include No\-Memory that chooses actions directly from the current observation to show the contribution of memory\.

Implementation Details\.On LoCoMo, we use text\-embedding\-3\-small for embedding, and utilize GPT\-4o and GPT\-4o\-mini\(Hurstet al\.,[2024](https://arxiv.org/html/2605.05583#bib.bib32)\)as base models and Qwen3\-Next\-80B\-A3B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2605.05583#bib.bib36)\)as the base model for ALFWorld\. All baselines are run with their released configurations\. For BeliefMem,pminp\_\{\\rm min\}andpmaxp\_\{\\rm max\}are shared across benchmarks, while the decay rateλ\\lambdais set per benchmark\. The hyperparameter configurations of all methods are listed in Appendix[A\.3](https://arxiv.org/html/2605.05583#A1.SS3)\.

### 4\.1Main results

Effectiveness in long conversational scenarios\. As demonstrated in Table[1](https://arxiv.org/html/2605.05583#S4.T1), BeliefMem achieves the highest average performance across both base models, producing substantial improvements in multi\-hop and temporal reasoning tasks\. These specific tasks rigorously test an agent’s ability to resolve observation conflicts and aggregate evidence over interactions\. The effectiveness of BeliefMem arises from its dynamic belief update mechanism, which continuously refines and retains essential historical context while mitigating memory degradation\. Furthermore, by archiving prior memory states with explicit temporal metadata, BeliefMem supports more precise retrieval of former environmental states, directly facilitating its superior temporal reasoning\.

Superiority in embodied interactive scenarios\.As detailed in Table[2](https://arxiv.org/html/2605.05583#S4.T2), BeliefMem consistently outperforms all baselines across seen and unseen tasks\. Specifically, BeliefMem outperforms the second\-best method \(ReadAgent\) by 11%, and exceeds the average of the remaining baselines by 99% overall\. This advantage grows to 12\.4% over the second\-best baseline in unseen \(out\-of\-distribution\) scenarios, demonstrating BeliefMem’s robust generalizability in real agent scenario with memory\. Crucially, BeliefMem achieves this superior performance using only half of the standard memory corpus\. In Section[4\.3](https://arxiv.org/html/2605.05583#S4.SS3), we provide a detailed analysis of this remarkable data efficiency, showing that only 16\.67% of the memory corpus is sufficient to outperform 5 out of 6 baselines\.

Table 2:ALFWorld results with the Qwen3\-Next\-80B\-A3B\-Instruct, on the in\-distribution \(Seen\) split and the out\-of\-distribution \(Unseen\) split\. SR\(%\): Success rate \(↑\\uparrow\); \#Steps: Average steps on solved episodes \(↓\\downarrow\)\.Δ\\Deltaindicates the difference relative to the best result in each column\. BeliefMem\*: 50% memory corpus used\. BeliefMem: full memory corpus used\.Table 3:Results of ablation studies on LoCoMo \(GPT\-4o\-mini\) and ALFWorld \(Qwen3\-Next\-80B\-A3B\-Instruct\)\. w/o memory: without belief\-based memory; w/o retrieval: without belief\-aware retrieval\.
### 4\.2Ablation Studies

We conduct comprehensive ablation studies to investigate probabilistic memory, belief\-aware retrieval, and memory update operations \(*Add*and*Merge*\) on LoCoMo and ALFWorld \(Table[3](https://arxiv.org/html/2605.05583#S4.T3)\)\. As shown, replacing probabilistic memory with standard deterministic memory \(w/o belief\-based memory\) results in clear performance drops in both benchmarks, highlighting the necessity of retaining uncertainty under partial observability\. Removing belief\-aware retrieval eliminates access to memory uncertainty, forcing the agent to discard candidate probabilities at retrieval and consequently degrading performance on both benchmarks\. Furthermore, ablating the update mechanisms undermines BeliefMem’s capabilities: removing the*Add*operation prevents the incorporation of new attributes of latent states into the memory bank, while removing*Merge*disables the probability updates over evidence for existing attributes\. Without them, the memory bank of BeliefMem remains static, notably decreasing the performance arising from dynamic memory updates\. Overall, these results show that each part of our method is vital for achieving reliable memory under partial observability\. The full results and detailed analyses are provided in Appendix[C\.2](https://arxiv.org/html/2605.05583#A3.SS2)\.

### 4\.3Analysis and Discussion

BeliefMem scales robustly under limited memory data\.Figure[4](https://arxiv.org/html/2605.05583#S4.F4)evaluates BeliefMem on ALFWorld with memory corpus sizes ranging from 500 to 3,000\. When using just 50% of the corpus, BeliefMem already outperforms all baselines, and even with 500 samples, it still surpasses 5 out of 6 baselines\. Beyond this advantage, we observe a trade\-off within BeliefMem: using the full memory corpus produces the best performance on seen tasks, whereas a 50% subset results in superior generalization on unseen tasks\. This trade\-off arises because a richer set of in\-distribution memories can bias the agent toward memorizing seen trajectories, at the cost of out\-of\-distribution generalizability\. These analyses demonstrate that BeliefMem’s probabilistic memory enables effective knowledge retention even when data is highly limited\. Full results are in Appendix[C\.1](https://arxiv.org/html/2605.05583#A3.SS1)\.

BeliefMem achieves reliable belief convergence\.To validate whether BeliefMem enables candidate probabilities to converge to the ground truth, we report the Top\-1 rate, defined as the proportion of instances where the true conclusion attains the highest confidence among all candidates\. In Figure[4](https://arxiv.org/html/2605.05583#S4.F4), Top\-1 rate of BeliefMem steadily increases as evidence accumulates, achieving 87\.68% of cases where the true conclusion receives the highest probability\. In contrast, a baseline using raw evidence frequency as confidence fails to converge reliably, as noisy observations distort the frequency of evidence\. Accordingly, these results demonstrate that BeliefMem’s memory update effectively filters noise and raises the confidence of true conclusions over time\. Details are provided in Appendix[B\.4](https://arxiv.org/html/2605.05583#A2.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.05583v1/x3.png)Figure 3:BeliefMem vs\. deterministic memory under adversarial setting on ALFWorld\.BeliefMem shows strong memory correction in adversarial settings\.We conduct adversarial experiments on ALFWorld benchmark by injecting strongly flawed memory conclusions into the memory bank and observing the correction process \(see Appendix[B\.5](https://arxiv.org/html/2605.05583#A2.SS5)for detailed pipeline\)\. As shown in Figure[3](https://arxiv.org/html/2605.05583#S4.F3), after updates with valid and noisy observations, BeliefMem achieves a correction rate nearly twice that of the deterministic memory baseline\. Furthermore, it achieves this correction notably faster, requiring an average of only 4\.75 steps\. These results highlight BeliefMem’s robustness and stability when handling flawed memories under noisy observations\.

Hyperparameter analysis\.We evaluate the impact of the retrieval sizeKKand decay rateλ\\lambdaon BeliefMem \(Table[7](https://arxiv.org/html/2605.05583#A3.T7)in Appendix[C](https://arxiv.org/html/2605.05583#A3)\)\. Performance on ALFWorld scales positively withKKup to an optimalK=20K=20\. Beyond this \(K=30K=30\), BeliefMem suffers a trade\-off: although its SR on seen tasks reaches the best, the SR on unseen tasks drops by 5\.22% as broader retrieval may surface noisy, in\-distribution memories that hinder generalization\. Additionally, variations inλ\\lambdaaffect the performance of BeliefMem, highlighting the critical role of the decay mechanism in controlling the agent’s reliance on early memory, thereby striking a crucial balance between efficient in\-distribution exploitation and robust out\-of\-distribution generalization\.

![Refer to caption](https://arxiv.org/html/2605.05583v1/x4.png)Figure 4:\(a\) BeliefMem maintains competitive performance across varying memory corpus sizes on ALFWorld, outperforming all baselines with only 50% of memory corpus\. \(b\) BeliefMem’s candidate probabilities reliably converge to the true conclusion as evidence accumulates on LoCoMo, whereas naive frequency\-based estimation fails to converge under noisy observations\.

## 5Conclusion

In this work, we identify a key drawback of prior memory methods in partially observable environments: their deterministic paradigm of storing categorical conclusions inferred from observations results in self\-reinforcing error\. To address this issue, we propose BeliefMem, which reframes memory as an approximation of the environment’s belief state\. Specifically, BeliefMem maintains multiple candidate conclusions with probabilities for each attribute of the evolving environment, updated via noisy\-OR evidence merge as new observations arrive\. During retrieval, these probabilistic conclusions enable the agent to reason under uncertainty and select reliable actions toward task goals\. Experiments on the LoCoMo and ALFWorld benchmarks show that our method outperforms well\-known baselines on average across diverse scenarios\. Additionally, various analyses in our work illustrate our method’s promising capabilities in memory correction and data efficiency\. Overall, our work introduces a novel perspective on agent memory in partially observable environments and demonstrates its empirical benefits under various settings\.

## References

- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- P\. Du \(2026\)Memory for autonomous llm agents: mechanisms, evaluation, and emerging frontiers\.arXiv preprint arXiv:2603\.07670\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1)\.
- J\. Geng, H\. Chen, R\. Liu, M\. H\. Ribeiro, R\. Willer, G\. Neubig, and T\. L\. Griffiths \(2025\)Accumulating context changes the beliefs of language models\.arXiv preprint arXiv:2511\.01805\.Cited by:[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1)\.
- Y\. Hu, Y\. Wang, and J\. McAuley \(2025a\)Evaluating memory in llm agents via incremental multi\-turn interactions\.arXiv preprint arXiv:2507\.05257\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p2.1)\.
- Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi,et al\.\(2025b\)Memory in the age of ai agents\.arXiv preprint arXiv:2512\.13564\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4](https://arxiv.org/html/2605.05583#S4.p3.3)\.
- H\. Jiang, L\. Ge, H\. Cai, and R\. Song \(2026\)PABU: progress\-aware belief update for efficient llm agents\.arXiv preprint arXiv:2602\.09138\.Cited by:[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1)\.
- L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra \(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2605.05583#S3.SS1.p1.7)\.
- J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai \(2025\)Memory os of ai agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25972–25981\.Cited by:[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- C\. Lam, J\. Li, L\. Zhang, and K\. Zhao \(2026\)Governing evolving memory in llm agents: risks, mechanisms, and the stability and safety governed memory \(ssgm\) framework\.arXiv preprint arXiv:2603\.11768\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p2.1)\.
- LangChain \(2025\)LangMem\.Note:[https://github\.com/langchain\-ai/langmem](https://github.com/langchain-ai/langmem)GitHub repositoryCited by:[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024\)A human\-inspired reading agent with gist memory of very long contexts\.arXiv preprint arXiv:2402\.09727\.Cited by:[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- Z\. Li, C\. Xi, C\. Li, D\. Chen, B\. Chen, S\. Song, S\. Niu, H\. Wang, J\. Yang, C\. Tang,et al\.\(2025\)Memos: a memory os for ai system\.arXiv preprint arXiv:2507\.03724\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1)\.
- A\. Lidayan, J\. Bjorner, S\. Golechha, K\. Goyal, and A\. Suhr \(2025\)ABBEL: llm agents acting through belief bottlenecks expressed in language\.arXiv preprint arXiv:2512\.20111\.Cited by:[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[§B\.2](https://arxiv.org/html/2605.05583#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.05583#S1.p5.1),[§4](https://arxiv.org/html/2605.05583#S4.p1.1),[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- C\. Packer, V\. Fang, S\. Patil, K\. Lin, S\. Wooders, and J\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.05583#S2.SS2.p1.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1)\.
- S\. Shao, Q\. Ren, C\. Qian, B\. Wei, D\. Guo, J\. Yang, X\. Song, L\. Zhang, W\. Zhang, D\. Liu,et al\.\(2025\)Your agent may misevolve: emergent risks in self\-evolving llm agents\.arXiv preprint arXiv:2509\.26354\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p2.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.05583#S2.SS2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§B\.1](https://arxiv.org/html/2605.05583#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05583#S1.p5.1),[§4](https://arxiv.org/html/2605.05583#S4.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2\.2](https://arxiv.org/html/2605.05583#S2.SS2.p1.1)\.
- Z\. Wang, S\. He, D\. Wu, J\. Wang, L\. Kang, J\. Yu, and Z\. Wang \(2025\)CoBel\-world: harnessing llm reasoning to build a collaborative belief world for optimizing embodied multi\-agent collaboration\.arXiv preprint arXiv:2509\.21981\.Cited by:[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.05583#S3.SS1.p2.10),[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, J\. Bi, K\. Kersting, J\. Z\. Pan,et al\.\(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv preprint arXiv:2508\.19828\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.05583#S3.SS1.p2.10)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2605.05583#S4.p3.3)\.
- \[27\]J\. C\. Yang, D\. Dailisan, and M\. FlechtnerBelief engine: bayesian memory for configurable opinion dynamics in llm agents\.InICLR 2026 Workshop on Memory for LLM\-Based Agentic Systems,Cited by:[§2\.3](https://arxiv.org/html/2605.05583#S2.SS3.p1.1)\.
- Y\. Yu, L\. Yao, Y\. Xie, Q\. Tan, J\. Feng, Y\. Li, and L\. Wu \(2026\)Agentic memory: learning unified long\-term and short\-term memory management for large language model agents\.arXiv preprint arXiv:2601\.01885\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1)\.
- H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang \(2026a\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[§B\.1](https://arxiv.org/html/2605.05583#A2.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.05583#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.05583#S2.SS2.p1.1),[§4](https://arxiv.org/html/2605.05583#S4.p1.1)\.
- S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026b\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2605.05583#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.05583#S2.SS2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1),[§4](https://arxiv.org/html/2605.05583#S4.p2.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)Mem1: learning to synergize memory and reasoning for efficient long\-horizon agents\.arXiv preprint arXiv:2506\.15841\.Cited by:[§2\.1](https://arxiv.org/html/2605.05583#S2.SS1.p1.1)\.

## Appendix AMore Implementation Details of BeliefMem\.

### A\.1More Details about Memory Update

Given a new observation, the agent first extracts a set of candidate conclusions using the prompt shown in Figure[6](https://arxiv.org/html/2605.05583#A3.F6)\. Each candidate is represented as a structured memory object, including its normalized conclusion, semantic slots, evidence references, temporal information, and belief scores\. In practice, an attributeccis formed from stable semantic slots such as subject, predicate, entities, and qualifiers; a candidate conclusionhhis the normalized conclusion text/object for that attribute\. Notably, the extractedprobfield is used as the evidence strengthΔ​\(ot\+1,h\)\\Delta\(o\_\{t\+1\},h\)in Eq\.[9](https://arxiv.org/html/2605.05583#S3.E9), which measures how strongly the new observation support the conclusionhh\. We use this value as an LLM\-extracted confidence, not as a calibrated posterior probability\. For*Add*, we clip this extracted value to\[pmin,pmax\]\[p\_\{\\min\},p\_\{\\max\}\]\. Throughout the implementation, the storedprobvalues are confidence scores used for ranking and updating, not calibrated probabilities\.

After extraction, BeliefMem updates the existing memory bank through several operations\. If the candidate describes a new conclusion that is not covered by the current memory bank, BeliefMem applies*Add*and inserts it as a new memory entry\. If the candidate provides compatible evidence for an existing conclusion \(using keyword matching over attribute conclusions\), BeliefMem uses*Merge*: the new evidence is attached to the existing memory, and the truth belief is updated with the noisy\-OR evidence aggregation in Eq\.[9](https://arxiv.org/html/2605.05583#S3.E9)\.

### A\.2Contradictory Memory

For any candidate conclusionhh, if the observationot\+1o\_\{t\+1\}provides evidence to support a contradictory conclusion, the current belief ofhhis reduced to0\.250\.25, called*Version*\. And the previous value is retained as a historical version\. Specifically, we use a rule\-based criterion to identify contradictory conclusions: Formally, let\(c,h\)\(c,h\)denote an existing memory conclusion of attributeccand\(c,h′\)\(c,h^\{\\prime\}\)denote a newly extracted candidate fromot\+1o\_\{t\+1\}via the operation in Appendix[A\.1](https://arxiv.org/html/2605.05583#A1.SS1)\. Whenh≠h′h\\neq h^\{\\prime\}, for same attributecc, the new candidate is treated as a contradictory conclusion forhh\.

### A\.3Hyperparameter Configuration

BeliefMem uses the same*Add*bounds in Eq\.[8](https://arxiv.org/html/2605.05583#S3.E8)across both benchmarks\. The initial probability interval is\[pmin,pmax\]=\[0\.7,0\.9\]\[p\_\{\\min\},\\,p\_\{\\max\}\]=\[0\.7,\\,0\.9\]\. The decay rateλ\\lambdain Eq\.[10](https://arxiv.org/html/2605.05583#S3.E10)is set to0\.50\.5for LoCoMo and0\.10\.1for ALFWorld\.

For LoCoMo, we use random seed2026041320260413\. Top\-K=20K=20for single\-hop questions and top\-K=30K=30for multi\-hop, temporal, and open\-domain questions\. For ALFWorld, we follow the official evaluation, which sets chunk size512512, query sourceobjective, Contriever retrieval, memory top\-K=20K=20, and a maximum of5050environment steps\. The action model is run with seed4242, temperature0\.00\.0, top\-p=1\.0p=1\.0, and a maximum generation length of3232\.

sim​\(⋅\)\\mathrm\{sim\}\(\\cdot\)in Eq\.[10](https://arxiv.org/html/2605.05583#S3.E10)employs a hybrid design\. Specifically, it is computed as a linear combination of embedding cosine similarity and lexical overlap \(both attribute and evidence\), with weights of 0\.7 and 0\.3, respectively, across all tasks\. In addition, to reduce cost and latency in BeliefMem, we set the maximum number of candidate conclusions per attribute to 4 during retrieval\.

### A\.4Memory Costs

![Refer to caption](https://arxiv.org/html/2605.05583v1/x5.png)Figure 5:Average token consumption of BeliefMem and competitive baselines on LoCoMo using GPT\-4o\-mini for each generation\.BeliefMem stores one candidate set for each active attribute and optionally preserves historical versions after*Merge*\. If attributecchasMcM\_\{c\}active candidates andvcv\_\{c\}retained versions, memory storage isO​\(∑cMc​vc\)O\(\\sum\_\{c\}M\_\{c\}v\_\{c\}\)textual entries plus embeddings\. Updating an observed attribute costsO​\(Mc\)O\(M\_\{c\}\)after LLM extraction, since only candidates under the matched attribute are updated\. Retrieval first scores attributes by semantic similarity and decay, then serializes the top\-KKattributes and their active candidates, so token cost grows with the number of retrieved candidates rather than only withKK\.

To reduce this cost, we cap the number of retrieved candidates per attribute\. Figure[5](https://arxiv.org/html/2605.05583#A1.F5)reports the average token consumption per generation on LoCoMo\. Our method uses fewer tokens than the competitive baselines, confirming that the strategy effectively limits overhead\.

### A\.5Hardware and Software

All base models and benchmarks used in this work are publicly accessible\. All experiments were conducted using NVIDIA A800\-80GB GPUs with Python3\.113\.11and PyTorch2\.4\.12\.4\.1\.

## Appendix BFurther Experiment Setup

### B\.1ALFWorld Evaluation Details

#### Evaluation split\.

For all methods in Section[4\.1](https://arxiv.org/html/2605.05583#S4.SS1), we conduct experiments on the official ALFWorld\[Shridharet al\.,[2020](https://arxiv.org/html/2605.05583#bib.bib30)\], evaluating the full 140 episodes of the in\-distribution*Seen*set and the full 134 episodes of the out\-of\-distribution*Unseen*set\. Both splits cover the six household goal templates \(Pick & Place, Examine in Light, Clean & Place, Heat & Place, Cool & Place, and Pick Two & Place\), and every episode runs under the standard 50\-step environment horizon\.

#### Memory bank construction and retrieval\.

We follow the ALFWorld pipeline ofZhanget al\.\[[2026a](https://arxiv.org/html/2605.05583#bib.bib19)\]for all memory methods in this paper\. Expert trajectories are collected from the official training split, where each trajectory records the full sequence of observations, actions, and outcomes produced by the demonstrating agent, and are then grouped by task type\. For every task type, a random subset of training trajectories is sampled as the experience corpus used for memory construction\. Evaluation uses the official Seen/Unseen episodes described above, so no evaluation trace is used during memory construction\. We fix the total bank size at 3,000 expert trajectories, distributed across the six task types unless otherwise stated, such as BeliefMem\* and the data\-size analysis\. Specifically, the memory bank is constructed once before evaluation begins, with each trajectory written through the native write operation of each baseline method\. At test time, every method, including BeliefMem and all baselines, retrieves up to 20 memory items per observation \(Top\-K=20K=20\)\. The sampled corpus and evaluation episodes are kept identical across methods\.

### B\.2LoCoMo Evaluation Details

For LoCoMo, we follow the official setting inMaharanaet al\.\[[2024](https://arxiv.org/html/2605.05583#bib.bib7)\]\. All baseline methods are reproduced with their open\-source settings described in their papers\. Additionally, we also present the performance of Mem0 and A\-MEM reported in their papers in Section[4\.1](https://arxiv.org/html/2605.05583#S4.SS1)for clarification\. They are denoted with∗\.

### B\.3Preserving Historical Beliefs

Instead of overwriting an existing beliefptp\_\{t\}during*Merge*, BeliefMem retainspt\(c\)p\_\{t\}^\{\(c\)\}as an independent historical entry alongside the newly updated current versionpt\+1p\_\{t\+1\}\. This update mechanism is essential for handling temporal queries\. While recent observations naturally update the agent’s current belief of the environment, queries targeting specific past contexts necessitate access to historical states\.

This temporal awareness is achieved through timestamp management\. The updated entrypt\+1p\_\{t\+1\}receives the latest timestamp, while the old entryptp\_\{t\}retains its original timestamp\. Following the decay mechanism in Eq\.[10](https://arxiv.org/html/2605.05583#S3.E10), the current version is naturally prioritized during default retrieval due to its recency\. Simultaneously, the historical version remains fully accessible for queries explicitly referencing earlier time steps, ensuring comprehensive temporal grounding without information loss\.

### B\.4Belief Coverage Analysis Experiment Setup

To ensure reproducibility and clarity, we detail the setup of belief convergence in Section[4\.3](https://arxiv.org/html/2605.05583#S4.SS3)\. We conduct our evaluation on the multi\-hop task of LoCoMo benchmark, maintaining all hyperparameters at their defaults\. Specifically, we choose the gold\-standard answers of 211 selected samples that can be mapped to a single attribute\-level conclusion as the target true states\. The observations are sampled from the memory corpus associated with these questions, ensuring they contain the necessary evidence to support these ground\-truth conclusions\. Through this rigorous configuration, we effectively validate the capacity of BeliefMem to make the memory belief converge toward the true conclusion\.

### B\.5Detailed Pipeline for Adversarial Memory Correction

To provide further clarity on the adversarial experiments discussed in Section[4\.3](https://arxiv.org/html/2605.05583#S4.SS3), this section details the complete experimental pipeline, including adversarial sample construction, update procedures, and evaluation metrics\.

Experimental Setup\.The correction process is evaluated through the following steps:

- •Flawed Memory Injection:We scan the BeliefMem memory bank evaluated on the ALFWorld benchmark to identify strongly flawed conclusions\. A memory entry is selected as an adversarial sample if it meets three criteria: \(1\) it contradicts the optimal action, \(2\) it is highly ranked \(retrieved in the Top\-KK\), and \(3\) the correct conclusion is entirely excluded from the Top\-KK\. This strict filtering yields 102 adversarial samples\.
- •Observation Generation:For each sample, we construct a sequence of observations to simulate the update process\. We generate 5valid observationsbased on the correct actions, providing sparse, ground\-truth hints\. Simultaneously, we construct 5noisy observationsderived from incorrect candidate conclusions \(strictly excluding the correct one\) to serve as adversarial perturbations during the memory update phase\.
- •Update Protocol:BeliefMem is updated using the default settings specified in Appendix[A\.3](https://arxiv.org/html/2605.05583#A1.SS3)\. As a baseline, the deterministic memory method only stores and updates a single conclusion per sample, as autonomously determined by the agent\. Additionally, we run 10 update steps, where each step randomly includes one of the valid or the noisy observation\.

Evaluation Metrics\.We assess memory correction performance using two primary metrics:

- •Correction Rate:The proportion of samples where the correct conclusion successfully outranks the injected flawed conclusion during retrieval after the update process\.
- •Correction Steps:The average number of update steps required for the correct conclusion to achieve a stably higher retrieval ranking than the flawed conclusion\.

## Appendix CFurther Empirical Results

### C\.1Full results of BeliefMem on ALFWorld with different memory corpus size\.

In this section, we provide the full results of Figure[4](https://arxiv.org/html/2605.05583#S4.F4), as shown in Table[4](https://arxiv.org/html/2605.05583#A3.T4)\. As detailed, we observe a generalization trade\-off related to memory corpus size\. Specifically, BeliefMem achieves its highest out\-of\-distribution \(ALF\-Unseen\) success rate of 61\.19% and optimal average performance of 59\.88% using only 1,500 samples, representing exactly 50% of the sampled memory corpus\. Additionally, the agent also exhibits maximum behavioral efficiency in novel environments, requiring a minimum of only 29\.34 steps to complete tasks\.

Conversely, scaling the memory corpus to the full 3,000 samples maximizes in\-distribution \(ALF\-Seen\) performance, reaching a peak success rate of 63\.57% with the fewest interaction steps \(27\.49\)\. However, this data increase results in a sharp 7\.44% decline in unseen success rates compared to the 1,500\-sample performance\. This divergence suggests that, in this setting, excessive environment\-specific data may induce corpus size overfitting, biasing the agent toward memorizing seen trajectories at the expense of generalizability\. We treat this explanation as plausible rather than conclusive, since no additional controlled test of this hypothesis is performed\. Furthermore, BeliefMem demonstrates exceptional low\-data robustness; with merely 500 samples \(16\.67% of the data\), it maintains an average success rate of 50\.38%\. Overall, these results show that BeliefMem efficiently distills actionable, generalizable memories from highly limited interactions, whereas simply increasing the memory corpus does not monotonically improve its robust environmental understanding\.

Table 4:The performance of BeliefMem on the ALFWorld dataset with varying corpus sizes\.
### C\.2Full Results of Ablation Studies

The complete results of the ablation on ALFWorld and LoCoMo are provided in Tables[5](https://arxiv.org/html/2605.05583#A3.T5)and[6](https://arxiv.org/html/2605.05583#A3.T6), respectively\.

#### w/o belief\-based memory\.

In this setting, we collapse the probabilistic memory to a deterministic one\. Specifically, for each attribute, only the single most likely conclusion is kept and retrievable\. As shown, success rate on ALFWorld drops from 59\.88 to 28\.71, and average F1 on LoCoMo falls from 42\.38 to 22\.58\. Without belief representation over conclusions, the agent acts on overconfident, often incorrect, memories and loses the ability to reason under partial observability\.

#### w/o belief\-aware retrieval\.

All candidate conclusions for an attribute are still stored, but their probabilities are discarded during retrieval, making them appear equally likely\. As observed, the performance drop is more moderate, where ALFWorld success rate declines to 51\.77 \(a drop of 8\.11 absolute SR points compared to full BeliefMem\) and LoCoMo average F1 decreases to 28\.50\. This indicates that merely retaining multiple hypotheses already preserves a useful degree of uncertainty\. However, these results change on the more challenging LoCoMo sub‑tasks: on multi‑hop and open‑domain questions, F1 decreases from 40\.51 to 27\.12 and from 28\.73 to 15\.89, respectively\. In these settings, the agent must overcome conflicting evidence, and without probabilities it is unable to judge between competing claims, leading to ambiguous retrieval\.

#### w/o*Add*\.

When*Add*is entirely removed, no new memory is inferred from observations\. This destroys the dynamic memory update in our method\. As a result, the ALFWorld success rate collapses to 22\.58%, and LoCoMo F1 drops to 14\.48%, showing that correct attribution of new evidence is a crucial condition for memory to remain organized and usable\.

#### w/o*Merge*\.

Removing*Merge*means that every new observation creates a separate attribute entry rather than updating an existing one with accumulated evidence\. Consequently, probabilities are never refined by subsequent observations and remain frozen at their initial values\. ALFWorld success rate falls to 40\.81% and LoCoMo F1 drops to 20\.38%, as the memory stays static and cannot integrate sequential information\.

Table 5:Full results of ablation studies on ALFWorld benchmark \(memory corpus size = 1500\)\.Table 6:Full results of ablation studies on LoCoMo benchmark using GPT\-4o\-mini\.

### C\.3Results of hyperparameter analysis

Top\-KKanalysis\.Table[7](https://arxiv.org/html/2605.05583#A3.T7)presents a sensitivity analysis of the retrieval sizeKKon ALFWorld, evaluating its impact on both task success rate \(SR\) and interaction efficiency \(Steps\)\. The empirical results demonstrate a clear non\-linear relationship between memory retrieval scale and the agent’s generalization capabilities, identifyingK=20K=20as the optimal threshold for BeliefMem\. Specifically, atK=20K=20, the model achieves the best generalization and overall efficiency, achieving an average SR of 59\.88% while minimizing the average step to 29\.55 steps\. When the retrieval size is overly restricted \(K≤10K\\leq 10\), the agent exhibits degraded performance across all metrics\. This indicates that an insufficientKKfails to retrieve adequate contextual memory\. Conversely, expanding the retrieval size toK=30K=30exposes a generalization trade\-off\. While a larger memory context maximizes the SR on in\-distribution tasks \(Seen SR peaks at 61\.43%\), it severely compromises out\-of\-distribution reasoning\. The Unseen SR experiences a sharp 8\.5% absolute degradation \(from 61\.19% down to 55\.97%\), with a corresponding degradation in execution efficiency \(Unseen steps increase to 32\.43\)\. This structural divergence suggests a trade\-off: excessive retrieval surfaces redundant, task\-specific noisy memories from seen environments\. Rather than augmenting the belief state, these extraneous in\-distribution memories may act as noise, impairing the agent’s generalization in unseen environments\.

Decay rateλ\\lambda\.To investigate the impact of the hyperparameter decay rateλ\\lambdaon both task efficacy and generalization, we conduct a comprehensive sensitivity analysis on ALFWorld\. As illustrated in Table[7](https://arxiv.org/html/2605.05583#A3.T7), removing this term \(w/o decay\) yields the highest in\-distribution SR \(63\.57%\) but results in the poorest out\-of\-distribution performance \(55\.97% unseen SR\) with the longest unseen trajectory length, clearly indicating strong reliance on earlier memories from seen environments\. However,λ=0\.1\\lambda=0\.1drastically shifts this dynamic, achieving the best unseen success rate \(61\.19%\) while sacrificing seen performance\. Asλ\\lambdais incrementally increased towards 0\.9, BeliefMem exhibits a steady recovery in seen environments while maintaining robust unseen generalization, achieving one of the highest average success rates\. Furthermore, the step metrics reveal that higher values ofλ\\lambda\(≥0\.9\\geq 0\.9\) consistently induce more efficient decision\-making\. Specifically, the agent uses fewer steps, driving the average trajectory length down to its minimum of 29\.00 steps atλ=1\.0\\lambda=1\.0, as the agent can retrieve more related early memories\. Consequently, the choice ofλ\\lambdaexplicitly dictates the agent’s reliance on its historical memory\. Increasingλ\\lambdaencourages the recall of past experiences to achieve optimal in\-distribution execution, whereas decreasingλ\\lambdaprevents the model from being constrained by prior patterns, strictly benefiting out\-of\-distribution performance\.

Table 7:Sensitivity analysis of hyperparameters \(Top\-KKandλ\\lambda\) on ALFWorld\.``Figure 6:The prompt used for attribute extraction\. It restricts the model to output format, fact\-based JSON objects grounded in the provided conversation\.

## Appendix DLimitations and Future Work

While BeliefMem successfully shifts the memory paradigm from storing deterministic conclusions to maintaining a belief representation of the underlying true states, achieving promising performance across diverse scenarios, several limitations remain to be addressed in future work:

- •Lack of theoretical guarantees for belief approximation\.BeliefMem maintains the probability of each candidate conclusion via noisy‑OR evidence aggregation rather than a complete normalized posterior distribution, because exact belief maintenance over an open‑ended hypothesis space is computationally infeasible\. Although this approximation provides no formal convergence guarantees, the experimental results in Figure[4](https://arxiv.org/html/2605.05583#S4.F4)show that as evidence accumulates the candidate probabilities reliably converge toward the true conclusion, demonstrating that the approximation is effective in practice\.
- •LLM\-extracted evidence strength\.The evidence strengthΔ\\Deltaused in the noisy‑OR update is extracted by LLMs instead of being derived from a calibrated observation likelihood model\. This can introduce noise when the model’s confidence estimates are inaccurate\. However, the adversarial correction experiments in Section[4\.3](https://arxiv.org/html/2605.05583#S4.SS3)indicate that BeliefMem is robust to noisy observations, achieving a correction rate for flawed memory entries that is nearly twice that of the deterministic baseline\.
- •Computational overhead\.Although we leverage an approximated belief representation, the computational cost remains non\-trivial compared to standard deterministic memory baselines, especially during memory writing and merging\. Given that our method uses fewer tokens than competitive baselines \(Table[A\.4](https://arxiv.org/html/2605.05583#A1.SS4)\), exploring more cost\-effective architectures for maintaining and updating belief\-based memory represents a promising direction for future work\.

## Appendix ELLM Usage Statement

In this paper, we employed the commercial large language model GPT‑5\-Chat for language refinement and manuscript polishing\. It was not used for generating research ideas, designing methods, or conducting a literature search and discovery\.

Similar Articles

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

arXiv cs.CL

HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.