Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper introduces Agent-BRACE, a method that decouples LLM agents into belief state and policy models to handle long-horizon tasks in partially observable environments. By verbalizing state uncertainty, it achieves significant performance improvements over baselines while maintaining constant context window size.
arXiv:2605.11436v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:13 AM
# Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Source: [https://arxiv.org/html/2605.11436](https://arxiv.org/html/2605.11436)
Joykirat Singh1Zaid Khan1Archiki Prasad1Justin Chih\-Yao Chen1 Akshay Nambi3Hyunji Lee1Elias Stengel\-Eskin2Mohit Bansal1 1UNC Chapel Hill2The University of Texas at Austin3Microsoft Research

###### Abstract

Large language models \(LLMs\) are increasingly deployed on long\-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps\. This leads to two challenges:partial observabilityrequires maintaining uncertainty over unobserved world attributes, andlong interaction historycauses context to grow without bound, diluting task\-relevant information\. A principled solution to both challenges is abelief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length\. In LLM agents, however, the open\-ended nature of text makes it unclear how to represent such a distribution\. Therefore, we introduceAgent\-BRACE:AgentBelief stateRepresentation viaAbstraction andConfidenceEstimation, a method that decouples an LLM agent into abelief state modeland apolicy model, jointly optimized via reinforcement learning\. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown\. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty\. Across long\-horizon, partially observable embodied language environments,Agent\-BRACEachieves an average absolute improvement of \+14\.5% \(Qwen2\.5\-3B\-Instruct\) and \+5\.3% \(Qwen3\-4B\-Instruct\), outperforming strong RL baselines while maintaining a near\-constant context window independent of episode length\. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates\.111Codebase:[https://github\.com/joykirat18/Agent\-BRACE](https://github.com/joykirat18/Agent-BRACE)

## 1Introduction

Large language models \(LLMs\) are increasingly being deployed as agents in long\-horizon, partially observable tasks like software engineering\(Yanget al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib38); Jimenezet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib39)\), web navigation\(Zhouet al\.,[2023](https://arxiv.org/html/2605.11436#bib.bib40); Denget al\.,[2023](https://arxiv.org/html/2605.11436#bib.bib42); Heet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib58)\), or research\(Luet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib41); Novikovet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib62)\)\. These models must act while inferring complex world state from incomplete observations over many steps – a setting that is traditionally modeled as a partially observable Markov Decision Process \(POMDP\)\(Åström,[1965](https://arxiv.org/html/2605.11436#bib.bib10)\)\. In this framing, an optimal policy needs to only condition on thebelief state, a posterior distribution over possible environment states given the history of past/current observations and past actions\. The belief state admits two complementary interpretations:\(1\)the distribution represents uncertainty over the state the agent is in, accounting for unobserved variables;\(2\)it serves as a sufficient statistic for the prior interaction historyℋt\\mathcal\{H\}\_\{t\}, allowing the agent to track observations over time\. Current LLM agents differ from traditional POMDP approaches in that they generally represent both actions and observations in text\. This enables interaction with open\-ended, unstructured environments that lack predefined action or observation schema, but complicates encoding an explicit belief state and introduces its own challenges\. First, without a sufficient statistic of history, LLM\-based policies must be conditioned on the raw interaction trajectory, leading to inefficient representation \([Fig\.˜1](https://arxiv.org/html/2605.11436#S1.F1);Raw history\), with the context length growing linearly in the episode length, thus increasing computational cost and diluting task\-relevant signals with spurious details\(Liuet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib50); Chunget al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib51)\)\([Fig\.˜1](https://arxiv.org/html/2605.11436#S1.F1);Context Length vs Accuracy\)\. Second, while POMDP approaches for large or continuous state spaces are well studied \(e\.g\., particle filters, predictive state representations\)\(Silver and Veness,[2010](https://arxiv.org/html/2605.11436#bib.bib33); Hafneret al\.,[2020](https://arxiv.org/html/2605.11436#bib.bib31); Gregoret al\.,[2019](https://arxiv.org/html/2605.11436#bib.bib32)\), the open\-ended nature of text poses its own challenges: it is unclear how to encode a distribution in text over a compositional state space\. Indeed, past work either relies on the LLM’s internal representation as a belief proxy\(Kamelet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib61)\)– which lacks interpretability and limits external verification – or externalizes belief into a free\-form natural language summary\(Zhouet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib4); Yuet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib15)\), which is more interpretable but collapses the belief distributionbt\(s\)b\_\{t\}\(s\)into a single point estimate\.

![Refer to caption](https://arxiv.org/html/2605.11436v1/x1.png)Figure 1:Three approaches to context management in long\-horizon POMDP agents\.Raw history\(left\), grows unbounded as𝒪\(n\)\\mathcal\{O\}\(n\)\.Summary\-based belief\(middle\) compresses history into a summary but collapses the belief distribution to a single MLE point estimates^∼bt\(s\)\\hat\{s\}\\sim b\_\{t\}\(s\), discarding uncertainty\.Agent\-BRACE\(right\) represents the belief as WEP\-annotated statements \(confirmed,probable,possible, etc\.\), approximating the full distributionbt\(s\)b\_\{t\}\(s\), with near\-constant context window\.Agent\-BRACE\(Qwen3\-4B\-Instruct\) outperforms both baselines in accuracy while maintaining constant context length \(right panel\)\.To tackle these challenges and preserve uncertainty in belief states for LLM agents, we introduceAgent\-BRACE:AgentBelief stateRepresentation viaAbstraction andConfidenceEstimation, a training method that represents an agent’s belief as text while simultaneously encoding uncertainty via verbalized probability estimates\. Following the POMDP formalization,Agent\-BRACEdecouples an LLM agent into two modules: abelief state modeland apolicy model, training them jointly using reinforcement learning \(RL\)\. As shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2)\(belief state update\), at each stepttthe belief state model takes as input the goal \(GG\), the previous belief \(btb\_\{t\}\), and the new observation \(ot\+1o\_\{t\+1\}\), and produces an updated approximate belief \(bt\+1b\_\{t\+1\}\) represented as a set of atomic natural language claims\. Crucially, each claim is annotated with a certainty label drawn from the Words of Estimative Probability \(WEP\) scale\(van Tielet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib12); Tanget al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib53); Sileo and Moens,[2023](https://arxiv.org/html/2605.11436#bib.bib54)\); an ordered Likert style vocabulary \(confirmed≻\\succalmost certain≻\\succprobable≻\\succpossible≻\\succunlikely≻\\succdoubtful≻\\succunknown\) that is grounded in how humans express uncertainty in natural language\. Prior work has shown LLMs can meaningfully produce and differentiate between such verbalized uncertainty expressions\(Linet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib44); Tianet al\.,[2023](https://arxiv.org/html/2605.11436#bib.bib43); Stengel\-Eskinet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib45)\)\. This yields a belief approximation that captures uncertainty and uses a discrete scale that LLMs can reliably produce and update\. Since the belief statebtb\_\{t\}is a sufficient approximation of the full history, the policy model can select an action conditioned on\(G,bt,ot\)\(G,b\_\{t\},o\_\{t\}\)rather than on the historyℋt\\mathcal\{H\}\_\{t\}– replacing an ever\-growing trajectory with a compact, bounded representation\. InAgent\-BRACE, the belief state model and the policy model are jointly trained via PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.11436#bib.bib19)\)\. The policy model is optimized to maximize the binary environment reward \(\+1 for success, 0 for failure\), providing the primary signal for action selection\. The belief state model is optimized via a combination of complementary rewards, each targeting a different failure mode in belief quality: state tracking ensures logical consistency\(Zouet al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib47); Yuanet al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib48)\), state correctness ensures factual grounding\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib63)\), diversity prevents uncertainty collapse\(Lenget al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib64)\), discounted success aligns belief quality with task outcomes, and format ensures structural consistency\. Ablations in Section[4](https://arxiv.org/html/2605.11436#S4)confirm the importance of each reward\.

We train and evaluateAgent\-BRACEon various long\-horizon, partially observable embodied language tasks\. Specifically,Agent\-BRACEis trained on Quest, a task from theTextWorld\(Côtéet al\.,[2018](https://arxiv.org/html/2605.11436#bib.bib23)\)environment, using Qwen2\.5\-3B\-Instruct\(Qwen Team,[2024](https://arxiv.org/html/2605.11436#bib.bib22)\)and Qwen3\-4B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2605.11436#bib.bib21)\)as base models and evaluate on threeTextWorldenvironments: Quest, Treasure, and Cooking\.Agent\-BRACEoutperforms all baselines, including ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib2)\), Direct\-Action \(RL trained\), ReAct \(RL trained\), MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib4)\), and PABU\(Jianget al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib6)\), achieving average accuracies of 72\.8% and 79\.3% on Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct, respectively – an average absolute improvement222All improvements reported in this paper are absolute unless otherwise stated\.of \+14\.5% over the strongest RL\-trained baseline \(Direct\-Action \(RL\)\) on Qwen2\.5\-3B\-Instruct and \+5\.3% on Qwen3\-4B\-Instruct\. Crucially,Agent\-BRACEmaintains a near constant context window while achieving the best performance\.Agent\-BRACEalso demonstrates strong generalization, achieving consistently high performance on Treasure and Cooking tasks despite being trained only on Quest\. Moreover, we show thatAgent\-BRACEcan be extended to other tasks, with \+2\.85% improvement over the strongest RL\-trained baseline on ALFWorld\(Shridharet al\.,[2020b](https://arxiv.org/html/2605.11436#bib.bib49)\)\. Our ablations confirm that each component contributes meaningfully: joint training, belief\-state rewards, and an expressive WEP label set each play a critical role – removing any one leads to meaningful degradation\. Further analysis shows that the belief becomes better calibrated over the course of an episode, with Brier score\(Glenn and others,[1950](https://arxiv.org/html/2605.11436#bib.bib24)\)decreasing from 0\.40 to 0\.28 and the fraction ofconfirmedclaims growing from 21% to 52% as evidence accumulates\.

![Refer to caption](https://arxiv.org/html/2605.11436v1/x2.png)Figure 2:Overview ofAgent\-BRACE\. The agent is decomposed into abelief state modelfϕf\_\{\\phi\}and apolicy modelπθ\\pi\_\{\\theta\}, jointly optimized via PPO \(Dual Training\)\. At each steptt,fϕf\_\{\\phi\}consumes the goalGG, previous beliefbtb\_\{t\}, and new observationot\+1o\_\{t\+1\}to produce an updated beliefbt\+1b\_\{t\+1\}with WEPs\-based certainty labels \(Belief State Update\)\. The policyπθ\\pi\_\{\\theta\}then selects an actionata\_\{t\}conditioned on\(G,bt\+1,ot\+1\)\(G,b\_\{t\+1\},o\_\{t\+1\}\)rather than the full historyℋt\\mathcal\{H\}\_\{t\}\(Action Selection\)\. The belief model is trained with a composite rewardRbeliefR^\{\\text\{belief\}\}, while the policy model is trained with a binary environment rewardRenvR^\{\\text\{env\}\}\.
## 2Methodology: Agent\-BRACE

In this section, we introduce our method in detail \([Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2)\)\.Agent\-BRACEaddresses two core challenges in long\-horizon agentic tasks: linear growth of past history and the absence of belief representation under partial observability\. To tackle this,Agent\-BRACEjointly trains abelief state modeland apolicy modelvia PPO, where the belief state model maintains a structured uncertainty\-aware belief that serves as a sufficient approximation of the history for downstream action selection\.

### 2\.1Environment and Agentic Task

We focus on partially\-observable environments modeled as POMDPs defined by the tupleℳ=\(S,T,A,Ω,O,R,γ\)\\mathcal\{M\}=\(S,T,A,\\Omega,O,R,\\gamma\), whereSSis the set of latent environment states,T:S×A→Δ\(S\)T:S\\times A\\to\\Delta\(S\)is the state\-transition distribution,AAis the natural language action space,Ω\\Omegais the observation space,O:S×A→Δ\(Ω\)O:S\\times A\\to\\Delta\(\\Omega\)is the observation distribution,R:S×A→ℝR:S\\times A\\to\\mathbb\{R\}is the reward function, andγ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor\. Since the current observationoto\_\{t\}is not a sufficient statistic for the environment statests\_\{t\}\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.11436#bib.bib9)\), an optimal policyπ\\pimust condition on the full historyℋt=\{G,o0,a0,…,ot\}\\mathcal\{H\}\_\{t\}=\\\{G,o\_\{0\},a\_\{0\},\\dots,o\_\{t\}\\\}, or an equivalent belief statebt=P\(st\|ℋt\)b\_\{t\}=P\(s\_\{t\}\|\\mathcal\{H\}\_\{t\}\), to maximize expected cumulative rewards\(Åström,[1965](https://arxiv.org/html/2605.11436#bib.bib10)\)\. We consider agentic tasks where an LLM pursues a goalGGby interacting with the environmentϵ\\epsilon, until the objective is achieved or a step budget is reached\.

### 2\.2Decoupled Architecture: Belief State Model and Policy Model

As shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2)\(Dual Training\), our approach parameterizes an agent with two jointly optimized components: abelief state modelfor state estimation and apolicy modelfor action selection:

Belief State Model \(fϕf\_\{\\phi\}\):This model is a learnable belief\-update function, constructing and maintaining an approximate belief representation from raw environment observations\. Each belief state is represented as a set of statements, where each statement is annotated with a WEP\-based uncertainty label\. As shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2)\(Belief State Update\),fϕf\_\{\\phi\}consumes the goalG, the current belief statebtb\_\{t\}, and the new observationot\+1o\_\{t\+1\}to produce an updated belief statebt\+1=fϕ\(G,bt,ot\+1\)b\_\{t\+1\}=f\_\{\\phi\}\(\\textit\{G\},b\_\{t\},o\_\{t\+1\}\)\.

Policy Model \(πθ\\pi\_\{\\theta\}\):This targets the long\-horizon history challenge by conditioning action selection on the compact beliefbt\+1b\_\{t\+1\}generated by the belief state model rather than the full historyℋt\\mathcal\{H\}\_\{t\}\. The next action becomesπθ\(G,bt\+1,ot\+1\)\\pi\_\{\\theta\}\(\\textit\{G\},b\_\{t\+1\},o\_\{t\+1\}\)\.

### 2\.3Belief State Representation

As shown in[Fig\.˜1](https://arxiv.org/html/2605.11436#S1.F1)\(Summary\), representing belief states as natural language summaries collapses the distribution over environment statests\_\{t\}into a single point estimate, discarding theuncertaintythat belief state policies rely on in partially observable settings\. Such summaries are unstructured, producing free\-flowing prose with no separation between distinct facts, making it difficult for the policy model to locate and extract task\-relevant signals\. This problem compounds over time as summaries grow to accommodate new observations\(Kanget al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib18)\), increasing both the length and density of interleaved facts the policy must parse\. Instead, we represent belief statebtb\_\{t\}asa set of verbalized belief statements: atomic natural language claims about individual aspects of the environment, each annotated with an explicit uncertainty label \(example shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2)\) drawn from the Words of Estimative Probability \(WEP\) scale\(Kent,[1964](https://arxiv.org/html/2605.11436#bib.bib66)\)– an ordered Likert\-style vocabulary grounded in how human naturally express uncertainty:confirmed≻\\succalmost certain≻\\succprobable≻\\succpossible≻\\succunlikely≻\\succdoubtful≻\\succunknown\. This yields a structured approximation of the belief distributionbt\(s\)b\_\{t\}\(s\)that is both interpretable and reliably produced by LLMs\(Linet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib44); Tianet al\.,[2023](https://arxiv.org/html/2605.11436#bib.bib43); Xionget al\.,[2023](https://arxiv.org/html/2605.11436#bib.bib52)\)\.

To define the belief space, we specify set of structured slots that every belief state must populate \(e\.g, agent location, inventory, etc\.\); further details in Appendix[A](https://arxiv.org/html/2605.11436#A1)\. While specific values and their instantiations are learned from task experience rather than hand\-specified, automatically discovering belief dimensions in fully open\-ended environments remains future work\. We ablate this structured belief space in Appendix[C](https://arxiv.org/html/2605.11436#A3), finding only a minor drop in performance without it, suggesting that the joint training provides a strong foundation for belief learning, with structured belief states delivering additional gains\.

### 2\.4Joint Training Procedure

As shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2), we train the belief state modelfϕf\_\{\\phi\}and policy modelπθ\\pi\_\{\\theta\}jointly via Proximal Policy Optimization \(PPO\)\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.11436#bib.bib19)\)\. Prior to PPO, we perform supervised finetuning \(SFT\) on the belief state model using GPT 5\.4 mini\(OpenAI,[2026](https://arxiv.org/html/2605.11436#bib.bib60)\)trajectories to enforce a structured belief state representation \(SFT detail in Appendix[B](https://arxiv.org/html/2605.11436#A2); SFT stage ablation in Appendix[C](https://arxiv.org/html/2605.11436#A3)\)\. The policy modelπθ\\pi\_\{\\theta\}is trained to maximize the binary environment rewardRenvR^\{\\text\{env\}\}\. Simultaneously, the belief state model is optimized to maximize the composite rewardRbeliefR^\{\\text\{belief\}\}\. The clipped objective is given by:

ℒpolicy/belief\(θ\)=−𝔼^t\[min⁡\(rt\(θ\)A^t,clip\(rt\(θ\),1−ϵ,1\+ϵ\)A^t\)\]\\mathcal\{L\}\_\{\\text\{policy/belief\}\}\(\\theta\)=\-\\hat\{\\mathbb\{E\}\}\_\{t\}\\left\[\\min\\left\(r\_\{t\}\(\\theta\)\\hat\{A\}\_\{t\},\\text\{clip\}\(r\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{t\}\\right\)\\right\]\(1\)wherert\(θ\)r\_\{t\}\(\\theta\)is the probability ratio\. For the policy model,A^t\\hat\{A\}\_\{t\}is estimated via Generalized Advantage Estimation \(GAE\)\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.11436#bib.bib20)\)using a learned criticV\(st\)V\(s\_\{t\}\)\. To reduce the computational cost, we approximate the advantage:A^tbelief=Rtbelief−μrσr\\hat\{A\}\_\{t\}^\{\\text\{belief\}\}=\\frac\{R\_\{t\}^\{\\text\{belief\}\}\-\\mu\_\{r\}\}\{\\sigma\_\{r\}\}, whereμr\\mu\_\{r\}andσr\\sigma\_\{r\}are the mean and standard deviation of rewards computed over rollouts from the same task\. Both modules are trained via the same objective, but differ in advantage estimation: the policy model uses GAE with a learned criticV\(st\)V\(s\_\{t\}\), while the belief state model uses GRPO style group\-normalized returns\(Shaoet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib29)\)\.

### 2\.5Reward Design

Agent\-BRACEis trained with two distinct reward signals:RbeliefR^\{\\text\{belief\}\}for the belief state model andRenvR^\{\\text\{env\}\}for the policy model, optimized jointly so that the belief representations are shaped by the policy’s decision\-making needs\. Both rewards are summarized in[Table˜1](https://arxiv.org/html/2605.11436#S2.T1)\. The final belief state model rewardRtbelief=rtformat×14\(rtst\+rtsc\+rtdiv\+rtsuccess\)R^\{\\text\{belief\}\}\_\{t\}=r^\{\\text\{format\}\}\_\{t\}\\times\\frac\{1\}\{4\}\\bigl\(r^\{\\text\{st\}\}\_\{t\}\+r^\{\\text\{sc\}\}\_\{t\}\+r^\{\\text\{div\}\}\_\{t\}\+r^\{\\text\{success\}\}\_\{t\}\\bigr\)\. Full detail in Appendix[D](https://arxiv.org/html/2605.11436#A4)\.

Table 1:Reward components for the belief state model and policy model\.RewardSymbolPurposeBelief State Model \(RbeliefR^\{\\text\{belief\}\}\)State Trackingrtstr\_\{t\}^\{\\text\{st\}\}Measures the logical consistency of the belief updatebt−1→btb\_\{t\-1\}\\to b\_\{t\}given observationoto\_\{t\}\.State Correctnessrtscr\_\{t\}^\{\\text\{sc\}\}Ensures that the belief state claim, along with its uncertainty score, remains grounded to the environment states\.Diversityrtdivr\_\{t\}^\{\\text\{div\}\}Encourages use of the full WEP vocabulary via entropyH\(bt\)H\(b\_\{t\}\)of the label histogramFormatrtformatr\_\{t\}^\{\\text\{format\}\}Enforces structured output; acts as a multiplicative gate zeroing all other rewards for invalid outputsTask Successrtsuccessr\_\{t\}^\{\\text\{success\}\}Propagates task outcome to the belief module viaγt×𝟏\[success\]\\gamma^\{t\}\\times\\mathbf\{1\}\[\\text\{success\}\]Policy Model \(RenvR^\{\\text\{env\}\}\)Task SuccessRenvR^\{\\text\{env\}\}Binary reward from the environment \(\+1\+1for success,0for failure\); primary reinforcement signal for action selection

## 3Experimental Setup and Results

### 3\.1Setup

Models\.To evaluateAgent\-BRACE, we adopt instruction tuned models, Qwen2\.5\-3B\-Instruct\(Qwen Team,[2024](https://arxiv.org/html/2605.11436#bib.bib22)\)and Qwen3\-4B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2605.11436#bib.bib21)\)as our base models\. Both the belief state and policy models are initialized from the same base model\. Additionally, we use Qwen3\-30B\-A3B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2605.11436#bib.bib21)\)as the judge to evaluate state tracking and correctness reward\.

Datasets\.We train and evaluate onTextWorld\(Côtéet al\.,[2018](https://arxiv.org/html/2605.11436#bib.bib23)\), which provides the flexibility to generate multiple types of text\-based games\. We construct three different tasks usingTextWorldenvironment: \(i\)Quests:A text adventure game environment where agents navigate rooms, manipulate objects, and solve quests via natural language; \(ii\)Cooking:This task takes place in a typical house and consists in finding the right food item and cooking it; \(iii\)Treasure:The agent spawns in a randomly generated maze and must find a specific object which is mentioned in the objective displayed when the game starts\.Agent\-BRACEand other training baselines were trained only on Quest and evaluated on all three datasets\. Full details are provided in Appendix[E](https://arxiv.org/html/2605.11436#A5)\.

Baselines\.We compare with several strong baselines\(1\) Base Model:off the shelf instruction tuned model;\(2\) ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib2)\): Interleaved reasoning and action selection;\(3\) Direct\-Action \(RL\):PPO trained model that directly outputs actions, using the same final environment reward asAgent\-BRACE;\(4\) ReAct \(RL\):PPO trained model that additionally outputs its thinking inside<think\>…</think\>tokens before taking an action;\(5\) MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib4)\): RL framework that maintains a compact shared state for memory consolidation and reasoning – integrating prior memory with new observations while strategically discarding irrelevant or redundant information;\(6\) PABU\(Jianget al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib6)\): Belief\-state framework that compactly represents an agent’s state by explicitly modeling task progress and selectively retaining past actions and observation\.

Implementation Details\.The maximum number of turns during training is set to be 15, and during inference, to test the long\-horizon capability of the method, we set the maximum number of turns to be 100\. Additional hyperparameter details are available in Appendix[F](https://arxiv.org/html/2605.11436#A6)\.

### 3\.2Main Results

Table 2:Performance comparison ofAgent\-BRACEagainst baselines across three TextWorld environments \(Quest, Treasure, Cooking\) on Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct\. Acc\. denotes task success rate \(%, higher is better\); Steps denotes average number of steps taken \(lower is better\)\.Blue rowsare inference\-only \(no training\);Agent\-BRACEand other baselines are trained with Quest only, while Treasure and Cooking are out\-of\-domain \(OOD\) tasks\.MethodQuestTreasureCookingAverageAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowQwen2\.5\-3B\-Instruct\\rowcolorInferenceRow Base Model4\.096\.17\.593\.22\.598\.14\.795\.8\\rowcolorInferenceRow ReAct23\.037\.637\.033\.627\.538\.429\.236\.5Direct\-Action \(RL\)56\.035\.867\.532\.651\.546\.158\.338\.2ReAct \(RL\)46\.534\.255\.032\.734\.544\.445\.337\.1MEM129\.562\.930\.047\.752\.548\.037\.352\.9PABU73\.037\.072\.534\.433\.073\.159\.548\.2Agent\-BRACE78\.537\.381\.532\.158\.560\.372\.843\.3Qwen3\-4B\-Instruct\\rowcolorInferenceRow Base Model61\.532\.365\.030\.369\.534\.165\.332\.2\\rowcolorInferenceRow ReAct60\.512\.669\.510\.313\.524\.447\.815\.8Direct\-Action \(RL\)74\.029\.672\.528\.075\.531\.974\.029\.8ReAct \(RL\)75\.518\.274\.016\.513\.040\.654\.225\.0MEM161\.550\.263\.531\.410\.010\.045\.030\.5PABU82\.229\.173\.537\.232\.575\.662\.747\.3Agent\-BRACE88\.030\.581\.030\.069\.044\.679\.335\.0
Agent\-BRACE outperforms other baselines\.[Section˜3\.2](https://arxiv.org/html/2605.11436#S3.SS2)presents the main results across threeTextWorldenvironments for both Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct\. Overall,Agent\-BRACEachieves the highest average accuracy across all baselines with 72\.8% on Qwen2\.5\-3B\-Instruct and 79\.3% on Qwen3\-4B\-Instruct, an absolute improvement of \+14\.5% and \+5\.3% over the strongest RL\-trained baseline, Direct\-Action \(RL\), respectively\. On Qwen2\.5\-3B\-Instruct,Agent\-BRACEoutperforms ReAct \(RL\) by \+27\.5%, demonstrating that interleaved chain\-of\-thought reasoning alone is insufficient under partial observability\. Against MEM1,Agent\-BRACEimproves by \+35\.5% on Qwen2\.5\-3B\-Instruct and \+34\.3% on Qwen3\-4B\-Instruct, confirming that summary\-based compression discards task\-critical signals\.Agent\-BRACEon average also outperforms PABU by \+13\.3% on Qwen2\.5\-3B\-Instruct and \+16\.6% on Qwen3\-4B\-Instruct\. The improvements are consistent across both models and suggest a clear pattern: baselines that treat history as a sufficient statistic, whether through raw context \(Direct Action, ReAct\), summarization \(MEM1\), or progress\-aware compression \(PABU\), cannot maintain an uncertainty approximation over world state\.Agent\-BRACE’s improvement stems from explicit representation of an approximate belief state via WEP annotations and jointly optimizing the belief state model with the policy, so that the agent learns to act under uncertainty rather than from a single point estimate of the world\.

![Refer to caption](https://arxiv.org/html/2605.11436v1/x3.png)Figure 3:Agent\-BRACEmaintains a near constant context window while achieving the highest solve rate \(78\.5%\)\. Comparison of context length growth \(left\) and cumulative solve rate \(right\) across methods with maximum 100 game steps on Quest using Qwen2\.5\-3B\-Instruct\.
Agent\-BRACE generalizes to held\-out TextWorld Environments\.Agent\-BRACEis exclusively trained on Quest, yet achieves strong and consistent performance across all three tasks \([Section˜3\.2](https://arxiv.org/html/2605.11436#S3.SS2)\)\. It transfers most effectively to Treasure \(81\.5% on Qwen2\.5\-3B\-Instruct and 81\.0% on Qwen3\-4B\-Instruct\), which shares Quest’s navigation structure\. On cooking, which requires sequential sub\-goal completion and is structured differently from Quest,Agent\-BRACEstill outperforms most baselines \(\+7\.0% over Direct\-Action and \+6\.0% over MEM1 on Qwen2\.5\-3B\-Instruct\)\. MEM1, ReAct \(RL\), and PABU all collapse on Cooking under Qwen3\-4B\-Instruct \(10\.0%, 13\.0%, and 32\.5%\), confirming that history\-based and progress\-aware representations are brittle under the non\-monotonic sub\-goal structure of Cooking\.Agent\-BRACE, maintaining per\-claim uncertainty without assuming a linear progress signal, remains robust across both settings without per\-task engineering\.

Agent\-BRACE maintains a bounded context window and enhanced solve rate\.[Fig\.˜3](https://arxiv.org/html/2605.11436#S3.F3)plots context length \(left\) and cumulative solve rate \(right\) across methods over 100 steps\. ReAct and Direct\-Action \(RL\) grow linearly, eventually exceeding the context limit, whileAgent\-BRACEmaintains a near constant context window\. Crucially, despite operating within a similar context budget as MEM1 and Summary\-Belief \(ablation run where belief state model is replaced with LLM summarizer\),Agent\-BRACEachieves a substantially higher cumulative solve rate of 78\.5% vs 28\.5% and 60\.4% respectively\. The gap isolates the contribution of approximating the belief distribution via WEP labels from context bounding alone\.

Table 3:Performance comparison ofAgent\-BRACE\(Qwen3\-4B\-Instruct\) against baselines on ALFWorld environment\.Blue rowsare inference\-only\.Agent\-BRACE performance on ALFWorld\.While Quest, Cooking, and Treasure share common navigation and inventory structure, we wanted to testAgent\-BRACEgeneralization to tasks with qualitatively different actions\. ALFWorld\(Shridharet al\.,[2020b](https://arxiv.org/html/2605.11436#bib.bib49)\)is built on top of the ALFRED\(Shridharet al\.,[2020a](https://arxiv.org/html/2605.11436#bib.bib65)\)household dataset that requires the agent to execute multi\-step object manipulation \(e\.g\., pick, clean, heat, cool, place\) with different observation and action structures from the TextWorld environment\.[Section˜3\.2](https://arxiv.org/html/2605.11436#S3.SS2)shows the performance ofAgent\-BRACE\(Qwen3\-4B\-Instruct\) on ALFWorld, averaged over three evaluation runs\.Agent\-BRACEachieves the highest accuracy of 30\.71%, outperforming the strongest RL\-trained baseline Direct\-Action \(RL\) by \+2\.85% and the Base Model by \+6\.42%\. It also outperforms memory based baseline MEM1 by 5%\. These results suggest thatAgent\-BRACE’s structured belief representation generalizes to other tasks\.

## 4Ablation and Analysis

To understand the importance of each component ofAgent\-BRACE, we do an ablation study, specifically\(1\)Agent\-BRACE\(Limited WEP\):Training the belief state model to only capture two levels of uncertainty –confirmedandunknown;\(2\)Agent\-BRACE\(Summary\-Belief\): Instead of training belief state to capture both abstraction of past history and uncertainty, it only summarizes past historyℋk\\mathcal\{H\}\_\{k\};\(3\)Agent\-BRACE\(Frozen Belief model\):The belief state model remains static and only the policy model is trained;\(4\)Agent\-BRACE\(No State Reward\):The belief model is only trained on discounted success reward and other state relevant rewards are removed\.

Table 4:Ablation analysis ofAgent\-BRACEusing Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct\. Each component is a variation ofAgent\-BRACE\.MethodQuestTreasureCookingAverageAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowQwen\-2\.5\-3B\-Instruct\\rowcolorAgentBRACERowAgent\-BRACE78\.537\.381\.532\.158\.560\.372\.843\.3\- Limited WEP76\.042\.471\.040\.158\.066\.168\.349\.5\- No State Reward78\.039\.473\.040\.263\.053\.771\.344\.5\- Frozen belief model66\.543\.559\.546\.228\.080\.351\.356\.7\- Summary\-Belief60\.424\.054\.334\.13\.518\.339\.425\.5Qwen3\-4B\-Instruct\\rowcolorAgentBRACERowAgent\-BRACE88\.030\.581\.030\.069\.044\.679\.335\.1\- Limited WEP79\.035\.669\.040\.748\.073\.965\.350\.1\- No State Reward59\.544\.464\.545\.158\.557\.660\.845\.0\- Frozen belief model77\.534\.757\.050\.135\.574\.156\.753\.3\- Summary\-Belief61\.547\.136\.066\.438\.567\.445\.360\.3
![Refer to caption](https://arxiv.org/html/2605.11436v1/x4.png)Figure 4:Brier score drops from 0\.40→\\rightarrow0\.28 whileconfirmedclaims grow 21% to 52%, confirming progressive calibration as evidence accumulates\. WEP label distribution \(bars, left axis\) and mean Brier Score \(line, right axis\) across agent steps for Qwen3\-4B\-Instruct \(Agent\-BRACE\) on Quest dataset\.
Belief uncertainty becomes better calibrated over the course of an episode\.[Fig\.˜4](https://arxiv.org/html/2605.11436#S4.F4)analyzes the uncertainty representations learned by the belief model\. To compute the Brier scores, we map each WEP label to its nominal probability following the ordinal scale \(e\.g\.,confirmed≈\\approx1\.0,probable≈\\approx0\.75,possible≈\\approx0\.50,doubtful≈\\approx0\.25,unknown≈\\approx0\.0\), and compute the Brier score\(Glenn and others,[1950](https://arxiv.org/html/2605.11436#bib.bib24)\)against binary ground truth from the LLM judge \(Qwen3\-30B\-A3B\-Instruct\-2507\); 0 indicates perfect calibration and 0\.25 corresponds to random chance brier score\. Brier score decreases steadily from 0\.4 at step 0 to below 0\.28 by step 14, a reduction of approximately 0\.12 points, and remains well below the random baseline of 0\.25 throughout, while the fraction ofconfirmedclaims grows from 21% to 52% – confirming that the belief model progressively sharpens its uncertainty estimate as evidence accumulates\. Additional calibration analysis \(Appendix[G](https://arxiv.org/html/2605.11436#A7)\) reveals that the belief model begins underconfident but becomes increasingly calibrated over the course of training, with high\-confidence labels \(almost certain,confirmed\) remaining well\-calibrated throughout – a safer failure mode than overconfidence\(Stengel\-Eskinet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib45)\)\. Appendix[K](https://arxiv.org/html/2605.11436#A11)provides qualitative examples of uncertainty labels being correctly assigned and progressively resolved as evidence accumulates\.

Joint training of belief model and policy drives performance gains\.[Section˜4](https://arxiv.org/html/2605.11436#S4)presents the results of various ablations ofAgent\-BRACEon Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct\. When the belief model is frozen and restricted to summarizing past observations \(Summary\-Belief\), the average performance drops by 33\.4% and 34\.0% for Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct, respectively\. The drop is most severe on Cooking \(58\.5%→\\rightarrow3\.5% on Qwen2\.5\-3B\-Instruct; 69\.0%→\\rightarrow38\.5% on Qwen3\-4B\-Instruct\), because Cooking requires fine\-grained uncertainty tracking over sequential sub\-goals and collapsing the belief distribution is particularly damaging when the agent must reason over multiple possible world states simultaneously\. Similarly, freezing the belief model and training only the policy \(Frozen belief model\) shows an average accuracy drop by 21\.5% absolute points on Qwen2\.5\-3B\-Instruct and 22\.6% points on Qwen3\-4B\-Instruct, demonstrating the importance of optimizing the belief and policy models rather than treating belief as a fixed module\.

Belief quality depends on reward design, label granularity, and task\-specific supervision\.As seen in[Section˜4](https://arxiv.org/html/2605.11436#S4), removing the state\-grounding rewards \(state correctness, state tracking, diversity, and format\) degrades average accuracy by 1\.5% on Qwen2\.5\-3B\-Instruct and 18\.5% on Qwen3\-4B\-Instruct, confirming that explicit grounding signals are critical for downstream performance\. Collapsing the 7\-level WEP scale to a binary certain/unknown drops average accuracy from 79\.3%→\\rightarrow65\.3% on Qwen3\-4B\-Instruct and 72\.8%→\\rightarrow68\.3% on Qwen2\.5\-3B\-Instruct, demonstrating that uncertainty granularity directly impacts policy performance\. Additionally we also ablated task\-specific belief supervision \(Appendix[C](https://arxiv.org/html/2605.11436#A3)\), i\.e, both the SFT warm\-start and domain\-structured prompting, which causes a moderate drop from 79\.3%→\\rightarrow69\.3% on Qwen3\-4B\-Instruct, yet without belief supervision,Agent\-BRACEstill outperforms most trained baselines, confirming that joint RL alone provides sufficient signal for structured world\-state tracking\.

## 5Related Work

Belief state estimation has a long history in sequential decision making under partial observability\. Classical POMDP solvers\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.11436#bib.bib9); Smallwood and Sondik,[1973](https://arxiv.org/html/2605.11436#bib.bib30)\)maintain exact belief distributions over finite, discrete state spaces with known transition and observation models\. To scale to continuous state spaces, Monte Carlo methods such as particle filters approximate the belief distribution\(Silver and Veness,[2010](https://arxiv.org/html/2605.11436#bib.bib33); Thrun,[1999](https://arxiv.org/html/2605.11436#bib.bib34)\)\. When transition and observation models are unknown, deep RL approaches learn belief representations as distributions over latent states via variational inference\(Hafneret al\.,[2020](https://arxiv.org/html/2605.11436#bib.bib31); Gregoret al\.,[2019](https://arxiv.org/html/2605.11436#bib.bib32)\)\. However, these approaches assume a pre\-specified, continuous state space and learn implicit latent representations, assumptions that break down in text\-based environments where the state space is open\-ended and interpretability is desirable\.

Increasingly, LLMs form the backbone of interactive agents for long\-horizon, partially observable tasks such as software and web navigation\. ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.11436#bib.bib2)\)interleaves chain\-of\-thought reasoning with actions, conditioning on a growing interaction history\. Several memory\-based approaches\(Yuet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib15); Kanget al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib18); Zhanget al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib36); Xuet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib3)\)address the resulting context burden: MEM1\(Zhouet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib4)\)selectively compresses past interactions through summarization\. Although these methods reduce the burden of long contexts, they still treat history as a sufficient representation of the agent’s state, without modeling uncertainty over the environment\.

A complementary line of work directly addresses belief representation in LLM agents\. One approach uses internal latent states as an implicit proxy for belief\(Kamelet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib61)\), sacrificing interpretability\. At the other extreme, StateAct\(Rozanov and Rei,[2025](https://arxiv.org/html/2605.11436#bib.bib5)\), ABBEL\(Lidayanet al\.,[2025](https://arxiv.org/html/2605.11436#bib.bib7)\), and PABU\(Jianget al\.,[2026](https://arxiv.org/html/2605.11436#bib.bib6)\)represent agent state as structured natural language summaries, but collapse the belief distribution to a single maximum\-likelihood estimate, discarding uncertainty inherent to belief states\.Agent\-BRACEaddresses it by externalizing belief in structured text while preserving uncertainty via per\-claim WEP annotations and jointly optimizing the belief state model and the policy model via reinforcement learning\.

## 6Conclusion

We introduce Agent\-BRACE, a training method that jointly trains a belief state model and a policy model via reinforcement learning\. The belief state model produces a structured approximation of the belief distribution as atomic natural language claims annotated with a Words of Estimative Probability label on an ordered scale\. Conditioning the policy on this compact belief rather than the raw interaction history simultaneously addresses two challenges that arise when LLM agents operate in POMDP\-style environments: representing uncertainty over an open\-ended state space, and bounding the context required for action selection\.Agent\-BRACEattains the highest average performance on three TextWorld environments on both Qwen2\.5\-3B\-Instruct and Qwen3\-4B\-Instruct\. Despite training only on Quest, the method transfers to the held\-out Treasure and Cooking tasks, suggesting the structured belief representation captures task\-relevant attributes without per\-task engineering\. Ablations confirm that joint optimization, the graded WEP scale, and the state tracking/correctness rewards each contribute meaningfully\. Further analysis shows the learned belief is well calibrated and sharpens as evidence accumulates, with Brier score dropping as the steps progress\.

## Acknowledgments

This work was supported by Microsoft Agentic AI Research and Innovation \(AARI\) grant program, NDSEG PhD Fellowship, NSF\-AI Engage Institute DRL\-2112635, NSF\-CAREER Award 1846185, and an Apple PhD Fellowship\. The views contained in this article are those of the authors and not of the funding agency\.

## References

- Optimal control of markov processes with incomplete state information i\.Journal of mathematical analysis and applications10,pp\. 174–205\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§2\.1](https://arxiv.org/html/2605.11436#S2.SS1.p1.15)\.
- A\. Chung, Y\. Zhang, K\. Lin, A\. Rawal, Q\. Gao, and J\. Chai \(2025\)Evaluating long\-context reasoning in llm\-based webagents\.arXiv preprint arXiv:2512\.04307\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- M\. Côté, A\. Kádár, X\. Yuan, B\. Kybartas, T\. Barnes, E\. Fine, J\. Moore, M\. Hausknecht, L\. El Asri, M\. Adada,et al\.\(2018\)Textworld: a learning environment for text\-based games\.InWorkshop on Computer Games,pp\. 41–75\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p2.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- W\. B\. Glennet al\.\(1950\)Verification of forecasts expressed in terms of probability\.Monthly weather review78\(1\),pp\. 1–3\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§4](https://arxiv.org/html/2605.11436#S4.13.13)\.
- K\. Gregor, D\. Jimenez Rezende, F\. Besse, Y\. Wu, H\. Merzic, and A\. van den Oord \(2019\)Shaping belief states with generative environment models for rl\.Advances in Neural Information Processing Systems32\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- D\. Hafner, T\. Lillicrap, M\. Norouzi, and J\. Ba \(2020\)Mastering atari with discrete world models\.arXiv preprint arXiv:2010\.02193\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)Webvoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- H\. Jiang, L\. Ge, H\. Cai, and R\. Song \(2026\)PABU: progress\-aware belief update for efficient llm agents\.arXiv preprint arXiv:2602\.09138\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p3.1),[§5](https://arxiv.org/html/2605.11436#S5.p3.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra \(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[§2\.1](https://arxiv.org/html/2605.11436#S2.SS1.p1.15),[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- A\. Kamel, T\. Rastogi, M\. Ma, K\. Ranganathan, and K\. Zhu \(2025\)Emergent world beliefs: exploring transformers in stochastic games\.arXiv preprint arXiv:2512\.23722\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§5](https://arxiv.org/html/2605.11436#S5.p3.1)\.
- M\. Kang, W\. Chen, D\. Han, H\. A\. Inan, L\. Wutschitz, Y\. Chen, R\. Sim, and S\. Rajmohan \(2025\)Acon: optimizing context compression for long\-horizon llm agents\.arXiv preprint arXiv:2510\.00615\.Cited by:[§2\.3](https://arxiv.org/html/2605.11436#S2.SS3.p1.9),[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- S\. Kent \(1964\)Words of estimative probability \(2nd edition\)\.Studies in Intelligence8\(4\),pp\. 49–65\.Cited by:[§2\.3](https://arxiv.org/html/2605.11436#S2.SS3.p1.9)\.
- J\. Leng, C\. Huang, B\. Zhu, and J\. Huang \(2024\)Taming overconfidence in llms: reward calibration in rlhf\.arXiv preprint arXiv:2410\.09724\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- A\. Lidayan, J\. Bjorner, S\. Golechha, K\. Goyal, and A\. Suhr \(2025\)ABBEL: llm agents acting through belief bottlenecks expressed in language\.arXiv preprint arXiv:2512\.20111\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p3.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14),[§2\.3](https://arxiv.org/html/2605.11436#S2.SS3.p1.9)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The ai scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)Alphaevolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- OpenAI \(2026\)Introducing gpt‑5\.4\.External Links:[Link](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§2\.4](https://arxiv.org/html/2605.11436#S2.SS4.p1.5)\.
- Qwen Team \(2024\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p1.1)\.
- N\. Rozanov and M\. Rei \(2025\)Stateact: enhancing llm base agents via self\-prompting and state\-tracking\.InProceedings of the 1st Workshop for Research on Agent Language Models \(REALM 2025\),pp\. 367–385\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p3.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2015\)High\-dimensional continuous control using generalized advantage estimation\.arXiv preprint arXiv:1506\.02438\.Cited by:[§2\.4](https://arxiv.org/html/2605.11436#S2.SS4.p1.12)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14),[§2\.4](https://arxiv.org/html/2605.11436#S2.SS4.p1.5)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2\.4](https://arxiv.org/html/2605.11436#S2.SS4.p1.12)\.
- M\. Shridhar, J\. Thomason, D\. Gordon, Y\. Bisk, W\. Han, R\. Mottaghi, L\. Zettlemoyer, and D\. Fox \(2020a\)ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks\.InThe IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Link](https://arxiv.org/abs/1912.01734)Cited by:[§3\.2](https://arxiv.org/html/2605.11436#S3.SS2.10.10.11)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020b\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.11436#S3.SS2.10.10.11)\.
- D\. Sileo and M\. F\. Moens \(2023\)Probing neural language models for understanding of words of estimative probability\.InProceedings of the 12th Joint Conference on Lexical and Computational Semantics \(\* SEM 2023\),pp\. 469–476\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- D\. Silver and J\. Veness \(2010\)Monte\-carlo planning in large pomdps\.Advances in neural information processing systems23\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- R\. D\. Smallwood and E\. J\. Sondik \(1973\)The optimal control of partially observable markov processes over a finite horizon\.Operations research21\(5\),pp\. 1071–1088\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- E\. Stengel\-Eskin, P\. Hase, and M\. Bansal \(2024\)LACIE: listener\-aware finetuning for calibration in large language models\.Advances in Neural Information Processing Systems37,pp\. 43080–43106\.Cited by:[Appendix G](https://arxiv.org/html/2605.11436#A7.p1.7),[§1](https://arxiv.org/html/2605.11436#S1.p2.14),[§4](https://arxiv.org/html/2605.11436#S4.13.13)\.
- Z\. Tang, K\. Shen, and M\. Kejriwal \(2026\)An evaluation of estimative uncertainty in large language models\.npj Complexity3\(1\),pp\. 8\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- S\. Thrun \(1999\)Monte carlo pomdps\.Advances in neural information processing systems12\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p1.1)\.
- K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. Manning \(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 5433–5442\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14),[§2\.3](https://arxiv.org/html/2605.11436#S2.SS3.p1.9)\.
- B\. van Tiel, U\. Sauerland, and M\. Franke \(2022\)Meaning and use in the expression of estimative probability\.Open Mind6,pp\. 250–263\.External Links:ISSN 2470\-2986,[Document](https://dx.doi.org/10.1162/opmi%5Fa%5F00066),[Link](https://doi.org/10.1162/opmi_a_00066),https://direct\.mit\.edu/opmi/article\-pdf/doi/10\.1162/opmi\_a\_00066/2061220/opmi\_a\_00066\.pdfCited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- R\. Wang and P\. Ammanabrolu \(2025\)A practitioner’s guide to multi\-turn agentic reinforcement learning\.External Links:2510\.01132,[Link](https://arxiv.org/abs/2510.01132)Cited by:[§E\.4](https://arxiv.org/html/2605.11436#A5.SS4.p1.1),[Appendix F](https://arxiv.org/html/2605.11436#A6.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2023\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.arXiv preprint arXiv:2306\.13063\.Cited by:[§2\.3](https://arxiv.org/html/2605.11436#S2.SS3.p1.9)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.External Links:2405\.15793,[Link](https://arxiv.org/abs/2405.15793)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p3.1),[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- H\. Yu, T\. Chen, J\. Feng, J\. Chen, W\. Dai, Q\. Yu, Y\. Zhang, W\. Ma, J\. Liu, M\. Wang,et al\.\(2025\)Memagent: reshaping long\-context llm with multi\-conv rl\-based memory agent\.arXiv preprint arXiv:2507\.02259\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- Z\. Yuan, S\. Yuan, and L\. Xie \(2026\)RPMS: enhancing llm\-based embodied planning through rule\-augmented memory synergy\.arXiv preprint arXiv:2603\.17831\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- Y\. Zhang, J\. Shu, Y\. Ma, X\. Lin, S\. Wu, and J\. Sang \(2025\)Memory as action: autonomous context curation for long\-horizon agentic tasks\.arXiv preprint arXiv:2510\.12635\.Cited by:[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- Y\. Zhao, J\. Zhu, Z\. Ding, and X\. Li \(2026\)GRACE: reinforcement learning for grounded response and abstention under contextual evidence\.External Links:2601\.04525,[Link](https://arxiv.org/abs/2601.04525)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)Mem1: learning to synergize memory and reasoning for efficient long\-horizon agents\.arXiv preprint arXiv:2506\.15841\.Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p1.2),[§1](https://arxiv.org/html/2605.11436#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11436#S3.SS1.p3.1),[§5](https://arxiv.org/html/2605.11436#S5.p2.1)\.
- D\. Zou, Y\. Chen, J\. Wang, H\. Yang, M\. Li, J\. Cheng, P\. Li, and Y\. Gong \(2026\)Reducing belief deviation in reinforcement learning for active reasoning\.External Links:2510\.12264,[Link](https://arxiv.org/abs/2510.12264)Cited by:[§1](https://arxiv.org/html/2605.11436#S1.p2.14)\.

## Appendix ABelief State Structure

InTextWorldenvironment, the belief state tracks five critical dimensions: \(i\) the agent’s current location, \(ii\) topological room connections, \(iii\) states of observed objects, \(iv\) inventory contents, and \(v\) progress relative to specific sub\-goals\. To ensure a clean separation of concerns between the belief model and the policy, the belief state is strictly prohibited from generating plans, intentions, or hypothetical future actions\. This ensures thatbkb\_\{k\}serves exclusively as an approximation of the current environment states, leaving the decision making to the policy model\.[Appendix˜K](https://arxiv.org/html/2605.11436#A11)shows the full prompt used for generating the belief state\.

## Appendix BBelief Model SFT on Teacher Trajectories

We first perform SFT on the belief modelfϕf\_\{\\phi\}using a teacher dataset𝒟teacher\\mathcal\{D\}\_\{\\text\{teacher\}\}\. These trajectories are generated by GPT 5\.4 mini using Prompt[K](https://arxiv.org/html/2605.11436#A11), which is prompted to perform state tracking\. For each transition\(G,bt−1,ot\)→bk\(G,b\_\{t\-1\},o\_\{t\}\)\\to b\_\{k\}, we optimize the standard cross entropy loss:

ℒSFT\(ϕ\)=−𝔼\(g,bk−1,ok,bk\)∼𝒟teacher\[∑tlog⁡fϕ\(bk,t∣g,bk−1,ok,bk,<t\)\]\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\phi\)=\-\\mathbb\{E\}\_\{\(g,b\_\{k\-1\},o\_\{k\},b\_\{k\}\)\\sim\\mathcal\{D\}\_\{\\text\{teacher\}\}\}\\left\[\\sum\_\{t\}\\log f\_\{\\phi\}\(b\_\{k,t\}\\mid g,b\_\{k\-1\},o\_\{k\},b\_\{k,<t\}\)\\right\]\(2\)This stage is necessary to teach the belief model to utilize the Likert certainty scale and adhere to the structured representation\.

To verify that the SFT phase in belief state model training only bootstraps structural formatting rather than transferring task knowledge,[Fig\.˜6](https://arxiv.org/html/2605.11436#A9.F6)tracks individual reward components across belief state model training \(Appendix[I](https://arxiv.org/html/2605.11436#A9)\)\. Format compliance remains consistently high \(\>0\.96\) from the first training step, while state correctness, diversity, and task rewards all begin near their lower bounds and rise steadily, confirming that belief quality is learned through RL, not inherited from the SFT teacher\.

## Appendix CBelief State without Task\-Specific Supervision

To decouple the contribution of task\-specific belief supervision, we run an additional ablation:Agent\-BRACEwithout belief supervision\. First, the belief state model is initialized directly from the base model \(Qwen3\-4B\-Instruct\) rather than from an SFT checkpoint trained on belief state annotations from GPT 5\.4 mini, eliminating the supervised warm\-start\. Second, the belief state prompt template is made fully domain\-agnostic: all game\-specific entity names like room and object identifiers in the in\-context examples are either removed or replaced with abstract placeholders, removing the domain\-specific belief prompt \(ref\.[Appendix˜K](https://arxiv.org/html/2605.11436#A11)for full prompt\)\. Under this configuration, the belief state model must bootstrap a meaningful world\-state representation entirely through PPO training, guided only by the task reward signal and belief state model rewards\.

As shown in[Appendix˜C](https://arxiv.org/html/2605.11436#A3), removing belief supervision leads to consistent degradation across all three tasks – average accuracy drops from 79\.3% to 69\.3%\. Comparing against the full Qwen3\-4B\-Instruct baseline suite from[Section˜3\.2](https://arxiv.org/html/2605.11436#S3.SS2), the ablation \(\- Belief supervision\) at 69\.3% average accuracy still outperforms PABU \(62\.7%\), ReAct RL \(54\.2%\), MEM1 \(45\.0%\), ReAct \(47\.8%\), and the Base Model \(65\.3%\), falling only a little behind Direct\-Action RL \(74\.0%\) among trained baselines\. Critically, the degradation here is moderate rather than catastrophic: the system is still able to retain substantial task\-solving capability even without any belief supervision\. This confirms that the SFT warm\-start and domain\-structured prompting are important catalysts for belief state model learning, but the joint RL training alone can provide a sufficiently strong training signal to teach structured world\-state tracking from a general\-purpose language prior\. This result is encouraging for real\-world applicability, where it is not always feasible to define task\-relevant states\.

Table 5:Ablation of belief state supervision on Qwen3\-4B\-Instruct\.Agent\-BRACEvs\. a variant with no SFT initialization and no domain\-specific belief prompt structure \(\- Belief Supervision\)\.## Appendix DReward Design

To trainAgent\-BRACE, we define two distinct reward signals: one for the belief state model \(RbeliefR^\{\\text\{belief\}\}\) and one for the policy model\(RenvR^\{\\text\{env\}\}\) – which are optimized jointly so that the belief representation is shaped by the policy’s decision\-making needs\.

#### Belief State Model Rewards \(RtbeliefR\_\{t\}^\{\\text\{belief\}\}\):

As shown in[Fig\.˜2](https://arxiv.org/html/2605.11436#S1.F2), the belief state model is trained using a composite reward consisting of five signals\. To show the importance of these rewards, we ablate belief state relevant rewards in[Section˜4](https://arxiv.org/html/2605.11436#S4)\.

- •State Tracking Reward\(rtstr^\{\\text\{st\}\}\_\{t\}\): This reward assesses the logical consistency of the belief updatebt−1→btb\_\{t\-1\}\\to b\_\{t\}given observationoto\_\{t\}\. An LLM judge \(ref\.[Appendix˜K](https://arxiv.org/html/2605.11436#A11)\) assesses whether new information inoko\_\{k\}is incorporated in the updated belief\. This reward counts,NnewN\_\{\\text\{new\}\}: new facts inoto\_\{t\}correctly added;NmissingN\_\{\\text\{missing\}\}: new facts absent or wrong inbtb\_\{t\};NstaleN\_\{\\text\{stale\}\}: prior beliefs contradicted byoto\_\{t\}but left unchanged;NtotalN\_\{\\text\{total\}\}: total claims inbtb\_\{t\}\. The reward is the product of coverage of new information and freshness of retained beliefs:rtst=NnewNnew\+Nmissing×\(1−NstaleNtotal\)\.r^\{\\text\{st\}\}\_\{t\}=\\frac\{N\{\\text\{new\}\}\}\{N\_\{\\text\{new\}\}\+N\_\{\\text\{missing\}\}\}\\times\\Bigl\(1\-\\frac\{N\_\{\\text\{stale\}\}\}\{N\_\{\\text\{total\}\}\}\\Bigr\)\.
- •State Correctness Reward\(rtscr^\{\\text\{sc\}\}\_\{t\}\): This reward ensures that the belief state claim, along with its uncertainty score, remains grounded to the environment states\. An LLM judge parsesbtb\_\{t\}into tuples\(subject,predicate,certainty\)\(subject,predicate,certainty\)\(ref\.[Appendix˜K](https://arxiv.org/html/2605.11436#A11)\), then verifies each tuple againststs\_\{t\}, classifying it as fully correct, partially correct \(the underlying fact is true, but the certainty label is miscalibrated\) and incorrect \(ref\.[Appendix˜K](https://arxiv.org/html/2605.11436#A11)\)\. Therefore finalrtsc=Ncorrect\+0\.5×NpartialNverr^\{\\text\{sc\}\}\_\{t\}=\\frac\{N\{\\text\{correct\}\}\+0\.5\\times N\_\{\\text\{partial\}\}\}\{N\_\{\\text\{ver\}\}\}, whereNver=Ncorrect\+Npartial\+NincorrectN\_\{\\text\{ver\}\}=N\_\{\\text\{correct\}\}\+N\_\{\\text\{partial\}\}\+N\_\{\\text\{incorrect\}\}\.
- •Format Reward\(rtformatr^\{\\text\{format\}\}\_\{t\}\): This reward assess adherence to the structured representation\. To mitigate reward hacking and degradation of the structure, we apply a multiplicative gating signal that enforces these constraints, assigning zero reward to structurally invalid outputs\.
- •Diversity Reward\(rtdivr^\{\\text\{div\}\}\_\{t\}\): Each uncertainty score in the belief state is matched against an ordered keyword list \(e\.g\.,confirmed≈\\approx1\.0,probable≈\\approx0\.75,possible≈\\approx0\.50,doubtful≈\\approx0\.25,unknown≈\\approx0\.0\) and mapped to one of 7 canonical levels\. Shannon entropyH=−∑pilog⁡piH=\-\\sum p\_\{i\}\\log p\_\{i\}is computed over the resulting label distribution and normalized to\[0,1\]\[0,1\]\. The reward is maximized when each individual claim’s uncertainty is spread evenly across the WEP vocabulary and minimized when they collapse onto a single label\.
- •Discounted Success Reward\(rtsuccessr^\{\\text\{success\}\}\_\{t\}\) A time step\-discounted success reward \(γt×𝟏\[success\]\\gamma^\{t\}\\times\\mathbf\{1\}\[\\text\{success\}\]\) is assigned to each belief state based on the agent’s final task outcome\.

The final reward assigned to a belief state isRtbelief=rtformat×14\(rtst\+rtsc\+rtdiv\+rtsuccess\)R^\{\\text\{belief\}\}\_\{t\}=r^\{\\text\{format\}\}\_\{t\}\\times\\frac\{1\}\{4\}\\bigl\(r^\{\\text\{st\}\}\_\{t\}\+r^\{\\text\{sc\}\}\_\{t\}\+r^\{\\text\{div\}\}\_\{t\}\+r^\{\\text\{success\}\}\_\{t\}\\bigr\)\. The format reward acts as a multiplicative gate, zeroing out all other rewards for structurally invalid outputs and preventing reward hacking\.

#### Policy Model Reward \(RenvR^\{\\text\{env\}\}\):

The policy modelπθ\\pi\_\{\\theta\}is optimized to maximize the policy reward, which is derived from the external environment\. This signal \(\+1 for task success, 0 for task failure\) provides the primary reinforcement for action selection, ensuring that the policy learns to take the next best action\.

## Appendix EDataset Details

All tasks are built on theTextWorldenvironment, a procedural text\-game generator that produces fully observable game graphs alongside natural\-language descriptions\. We construct three benchmark suites with controlled difficulty curricula: Quest, Treasure, and Cooking\. All three suites share the same split sizes: 1,000 training games, 100 validation games, and 200 test games, generated with non\-overlapping base seeds \(10000 / 20000 / 30000, respectively\)\. Training games span easier difficulty levels, while validation and test games sample progressively harder configurations to measure out\-of\-distribution generalization\. We also extend our dataset to the ALFWorld environment\.

### E\.1Quest

#### Task\.

The agent must navigate a multi\-room environment, locate and manipulate objects \(keys, containers, doors\), and collect a designated target object\. Winning requires executing the full quest sequence in the correct order\.

#### Generation\.

Games are generated withtw\-make custom, parametrized by arooms:objects:questlengthtriplet that directly controls world complexity\. Training games cycle across four configurations \(Table[6](https://arxiv.org/html/2605.11436#A5.T6)\)\. Validation and test games cycle across five harder configurations ranging from 6 rooms / 6 objects / 8\-step quests up to 8 rooms / 12 objects / 13\-step quests\.

Table 6:Basic \(Quest\) training configurations\.

### E\.2Treasure

#### Task\.

The agent is placed inside a procedurally generated maze and must locate a named treasure object \(*e\.g\.*, a latchkey\) hidden in a random room\. The objective is given in natural language at the start of each episode\.

#### Generation\.

Games are generated withtw\-make tw\-treasure\_hunter \-\-levelLL, where the level integer \(11–3030\) jointly governs world size, container nesting depth, and the number of distractor objects\. Training games cycle across levels\{1,2,4,6,8\}\\\{1,2,4,6,8\\\}\(easy band\); Validation and test games cycle across levels\{14,16,18,20,22,25,28,30\}\\\{14,16,18,20,22,25,28,30\\\}\.

### E\.3Cooking

#### Task\.

The agent must find a recipe posted in a kitchen cookbook and execute it: navigate to relevant rooms, gather the required ingredients, apply the correct preparation steps \(opening containers, cutting, cooking\), and finally prepare and eat the meal\.

#### Generation\.

Games are generated withtw\-make tw\-cooking, parametrized asrecipe:take:go:flags, whererecipeis the number of required ingredients,takeis the number of objects to pick up,gois the number of rooms, andflagsencode which mechanics are active:o\(openable containers\),c\(cooking appliance\),t\(cutting board\),d\(limited inventory / drop required\)\. Training configurations cycle across four settings \(Table[7](https://arxiv.org/html/2605.11436#A5.T7)\)\. Validation and test configurations cycle across five harder settings \(4–5 ingredients, 9–12 rooms\)\.

Table 7:Cooking training configurations\.

### E\.4ALFWorld

The training and testing datasets for ALFWorld are directly taken fromWang and Ammanabrolu\[[2025](https://arxiv.org/html/2605.11436#bib.bib46)\]\.

## Appendix FImplementation Details and Hyperparameters

Our codebase is built on top ofWang and Ammanabrolu\[[2025](https://arxiv.org/html/2605.11436#bib.bib46)\]and all training is run on a single node with 4 NVIDIA GPUs \(A100\)\.

### F\.1Models Used

To train and evaluateAgent\-BRACE, we adopt instruction\-tuned models, Qwen2\.5\-3B\-Instruct333https://huggingface\.co/Qwen/Qwen2\.5\-3B\-Instructand Qwen3\-4B\-Instruct444https://huggingface\.co/Qwen/Qwen3\-4B\-Instruct\-2507as our base models\. Additionally, we use Qwen3\-30B\-A3B\-Instruct\-2507555https://huggingface\.co/Qwen/Qwen3\-30B\-A3B\-Instruct\-2507as the judge to evaluate state tracking and correctness reward\.

### F\.2Belief State Model: Supervised Pre\-training \(SFT\)

Before joint RL training, the belief\-state LM is fine\-tuned with supervised learning on belief\-state trajectories generated from teacher demonstrations\. Hyperparameters are summarised in Table[8](https://arxiv.org/html/2605.11436#A6.T8)\.

Table 8:Belief\-state SFT hyperparameters\.
### F\.3Policy PPO Training

The policy is trained with Proximal Policy Optimization \(PPO\) using a Generalized Advantage Estimation \(GAE\) critic\. Table[9](https://arxiv.org/html/2605.11436#A6.T9)lists the PPO hyperparameters used\.

Table 9:Policy PPO hyperparameters \(shared across environments\)\.
### F\.4Joint Belief\-State PPO Training

During RL, the belief\-state LM is updated jointly with the policy after every rollout batch\. Belief\-model training hyperparameters are given in Table[10](https://arxiv.org/html/2605.11436#A6.T10)\.

Table 10:Joint belief\-state RL hyperparameters\.
## Appendix GBelief States are underconfident but improve over training

[Fig\.˜5](https://arxiv.org/html/2605.11436#A7.F5)analyzes the calibration of WEP labels at early \(steps 0\-4\) and late \(steps 10\-15\) training stages\. For each WEP label emitted by the belief state model, we measure the empirical truth rate – the fraction of claims carrying that label that are independently verified as true by the LLM judge \(Qwen3\-30B\-A3B\-Instruct\-2507\) and compare it against thenominal probabilitythat the label represents on the WEP scale \(e\.g\.,confirmed≈\\approx1\.0,probable≈\\approx0\.75,possible≈\\approx0\.50,doubtful≈\\approx0\.25,unknown≈\\approx0\.0; shown in gray\)\. A perfectly calibrated belief model would have these match exactly; a model whose empirical truth rate consistently exceeds the nominal probability is underconfident; it assigns lower\-confidence labels to beliefs that are in fact more often true than those labels suggest \(e\.g\., labeling somethingpossiblewhen it is actually true 84% of the time\)\.[Fig\.˜5](https://arxiv.org/html/2605.11436#A7.F5)shows that this is precisely what occurs: at early training, claims labeledunknownare verified true 68% of the time, far above the nominal probability of approximately 0, shrinking to 54% by late training\. This indicates the model increasingly reservesunknownfor genuinely uncertain claims rather than as a default label\. At the high\-confidence end,almost certainandconfirmedremain well calibrated throughout \(≥\\geq91% and≥\\geq95% respectively\)\. In sequential decision making under partial observability, underconfidence is a safer failure mode than overconfidence\[Stengel\-Eskinet al\.,[2024](https://arxiv.org/html/2605.11436#bib.bib45)\]: an agent that hedges will continue to explore and gather evidence, whereas an overconfident agent risks committing to an incorrect world model and acting on it irreversibly\.

![Refer to caption](https://arxiv.org/html/2605.11436v1/x5.png)Figure 5:Calibration of WEP labels at early \(0\-4\) and late \(10\-15\) steps\. For each WEP label emitted by the belief model, the colored bar shows the fraction of those claims independently verified as true, while the grey bar shows the nominal probability that the label represents on the WEP scale\. The belief model is systematically underconfident; it assigns conservative labels to beliefs that are more often true than the label implies\.## Appendix HStatistical Reliability of Main Results

To assess the statistical reliability ofAgent\-BRACE, we report mean accuracy and standard deviation across three independent runs for all methods trained on Quest and evaluated on all three TextWorld environments\.[Appendix˜H](https://arxiv.org/html/2605.11436#A8)reports the result for Qwen3\-4B\-Instruct across three independent runs\.

Table 11:Performance comparison ofAgent\-BRACEagainst baselines across three TextWorld environments \(Quest, Treasure, Cooking\) on Qwen3\-4B\-Instruct with standard deviation across three independent runs\.MethodQuestTreasureCookingAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrowAcc\.↑\\uparrowSteps↓\\downarrow\\rowcolorInferenceRow Base Model61\.5±\\pm0\.032\.2±\\pm0\.166\.0±\\pm0\.929\.9±\\pm0\.369±\\pm1\.433\.9±\\pm0\.2\\rowcolorInferenceRow ReAct61\.0±\\pm0\.612\.8±\\pm0\.269\.5±\\pm0\.910\.0±\\pm0\.213\.2±\\pm0\.324\.4±\\pm0\.0Direct\-Action \(RL\)74±\\pm0\.829\.8±\\pm0\.772\.5±\\pm0\.328\.1±\\pm0\.176±\\pm0\.631\.5±\\pm0\.3ReAct \(RL\)75\.5±\\pm0\.618\.3±\\pm0\.274\.0±\\pm0\.016\.5±\\pm0\.013\.0±\\pm0\.040\.6±\\pm0\.0PABU83\.0±\\pm0\.926\.4±\\pm2\.470\.7±\\pm3\.037\.8±\\pm1\.432\.5±\\pm0\.372\.1±\\pm3\.1Agent\-BRACE88±\\pm0\.930\.8±\\pm0\.581±\\pm0\.629\.9±\\pm0\.268\.7±\\pm0\.644\.7±\\pm0\.2
## Appendix IBelief State Model Training

![Refer to caption](https://arxiv.org/html/2605.11436v1/x6.png)Figure 6:Reward component trajectories across PPO training steps for the belief state model \(Qwen3\-4B\-Instruct\)\.[Fig\.˜6](https://arxiv.org/html/2605.11436#A9.F6)plots the five reward components along with the mean total reward across belief state model training steps\. If the SFT cold start phase were distilling task knowledge from GPT 5\.4 mini, we would expect state correctness and total task reward to begin high and plateau quickly, with minimal learning signal during PPO\. Instead, we observe the opposite: format compliance starts near 0\.98 and remains consistently high throughout training, confirming that SFT successfully bootstrapped structured adherence to the WEP\-annotated belief format\. In contrast, state correctness begins near 0\.2 and rises steadily to 0\.65, diversity reward rises from 0\.3 to 0\.65, and similarly discounted task reward from 0\.55 to 0\.78\. This pattern demonstrates that factual belief quality, uncertainty diversity, and task performance are learned through RL interaction with the environment, not inherited from the teacher model\. The SFT phase contributes only to the format; all substantive learning is attributed to the RL training signal\.

## Appendix JAgent\-BRACEwithout state correctness and tracking reward

[Appendix˜J](https://arxiv.org/html/2605.11436#A10)ablates the LLM judged belief state rewards by settingrtstr\_\{t\}^\{\\text\{st\}\}andrtscr\_\{t\}^\{\\text\{sc\}\}to zero while keeping all other components fixed\. Removing both the rewards degrades the average accuracy by 4% \(79\.3%→\\rightarrow75\.3%\)\. Notably, on average the ablation still outperforms all baselines reported in[Section˜3\.2](https://arxiv.org/html/2605.11436#S3.SS2), demonstrating that the removal of LLM judge reward does not catastrophically degrade the performance ofAgent\-BRACE\.

Table 12:Ablation of LLM judged based belief state rewards\.Agent\-BRACEvs\. a variant with no State correctness and tracking reward\.## Appendix KQualitative Examples

To illustrate the properties of our jointly trained belief state, we present representative examples from our method on cooking tasks\.

[Appendix˜K](https://arxiv.org/html/2605.11436#A11)illustrates calibrated uncertainty and its correct resolution: after a singleopen plain dooraction, the south exit transitions frompossibletoconfirmed, while genuinely unknown quantities such as the cookbook location and shelf contents remain marked as unknown rather than being hallucinated\.

[Appendix˜K](https://arxiv.org/html/2605.11436#A11)demonstrates that the belief state tracks world\-state changes without over\-committing to unseen information\. After opening the sliding patio door, the model correctly upgrades that exit toconfirmed open, while unvisited north and west exits remainpossiblewith “destination not yet observed” — reflecting the agent’s actual epistemic boundary\.

[Appendix˜K](https://arxiv.org/html/2605.11436#A11)shows fine\-grained inventory tracking across multiple items\. The belief state simultaneously records five items with exact processing descriptors \(chopped burned,sliced burned\) while correctly deferring on recipe requirements and counter contents, both markedunknownpending cookbook consultation\.

You are playing a text\-based game\. Given the goal, your previous belief state, and the current observation, produce an updated belief state capturing what you know and how confidently you know it\.Goal:\{goal\}Previous belief state:\{previous\_belief\_state\}Current observation:\{current\_obs\}OutputONLYa belief state within<belief\_state\> </belief\_state\>tags\.— STRICT RULES —Your belief stateMUST NOTcontain: \(i\) any next action, plan, or intention; \(ii\) forward\-looking phrases: “I will”, “I should”, “my next step”, “I plan to”, etc\.; \(iii\) any recommendation about which command to execute\.ONLYrecord what you have already observed or can directly infer from past observations\.— UPDATE RULES —•If the current observationCONFIRMSa previous bullet→\\rightarrowupgrade it to “confirmed”\.•If the current observationCONTRADICTSa previous bullet→\\rightarrowreplace it immediately\.•Never carry forward a stale bullet that conflicts with a direct observation\.— CERTAINTY SCALE —Every bulletMUSTcontain exactly one of these markers, used naturally in the sentence:confirmed / certaindirectly observed this turnalmost certainobserved previously, no contradicting evidence sinceprobableinferred from goal structure or strong patternpossibleno visit yet, some contextual reason to believeunlikelyvisited nearby rooms, no supporting evidence founddoubtfulcontradicting evidence existsunknownno evidence— MANDATORY COVERAGE —YouMUSTcover each of the following in at least one bullet: current location; each known exit and where it leads; each goal\-relevant object \(its location and state\); your inventory; progress toward each sub\-goal\.— FORMAT —One bullet per distinct fact, starting with “\- ”\. Every bullet contains exactly one certainty marker\. No JSON, no percentages, no key\-value pairs\.Examples:•It isconfirmedthat I am in the kitchen\.•The east exitalmost certainlyleads to the hallway based on prior exploration\.•The key isprobablystill in the living room where I last saw it\.•It ispossiblethe chest in the bedroom contains the goal item, though I have not visited\.•The couch isruled outfrom the bedroom – I visited and did not observe it there\.•It isdoubtfulthe garden door is unlocked given every other door here has been locked\.•It isconfirmedthat I am carrying only the brass key\.

You are evaluating a belief\-state update in a text\-based game\.Previous belief state:\{prev\_belief\}New observation:\{new\_obs\}New belief state:\{new\_belief\}Step 1 — Identify\(brief, one line each\):•New facts:list each distinct fact the observation reveals \(e\.g\. “player moved to kitchen”, “door is open”\)•Missing:which of those new facts are absent or wrong in the new belief state•Stale:which prior beliefs does the observation contradict that were left unchangedStep 2 — Count\(integers\):•NnewN\_\{\\text\{new\}\}= number of new facts correctly captured•NmissingN\_\{\\text\{missing\}\}= number of new facts missing or wrong•NstaleN\_\{\\text\{stale\}\}= number of stale/contradicted priors left unchanged•NtotalN\_\{\\text\{total\}\}= total claims in the new belief stateStep 3 — Compute:•IfNtotal=0N\_\{\\text\{total\}\}=0:score=0\.0\\text\{score\}=0\.0•Otherwise:coverage=Nnewmax⁡\(1,Nnew\+Nmissing\)\\displaystyle=\\frac\{N\_\{\\text\{new\}\}\}\{\\max\(1,\\;N\_\{\\text\{new\}\}\+N\_\{\\text\{missing\}\}\)\}staleness=NstaleNtotal\\displaystyle=\\frac\{N\_\{\\text\{stale\}\}\}\{N\_\{\\text\{total\}\}\}score=coverage×\(1−staleness\)clamped to\[0\.00,1\.00\]\\displaystyle=\\text\{coverage\}\\times\(1\-\\text\{staleness\}\)\\qquad\\text\{clamped to \}\[0\.00,1\.00\]End with exactly<score\>X\.XX</score\>whereX\.XXis a decimal in\[0\.00,1\.00\]\[0\.00,1\.00\]\.

You are analysing a belief state from a text\-based game\.Belief state:\{belief\_state\}Task —Extract every specific factual claim from the belief state\. List each claim on its own line using this exact format:CLAIM: <subject\> \| <predicate\> \| <certainty\-label\>Examples:•CLAIM: player location \| in the kitchen \| certain•CLAIM: key \| on the table in the library \| probable•CLAIM: east exit from kitchen \| leads to hallway \| almost certain•CLAIM: chest \| open \| possibleListALLclaims now \(one per line\):

You are verifying factual claims from a belief state against the true game world state\.True game world state \(ground truth\):\{raw\_state\}Claims to verify:\{claims\}Instructions —For each claim decide:•Correct— the underlying fact is true*and*the certainty label is appropriate\.•Incorrect— the underlying fact is false \(label does not matter\)\.•Partially correct— the fact is true but the certainty label is badly miscalibrated \(e\.g\. markedcertainfor something onlyprobableis supported by evidence, orunknownfor something directly observable in the true state\)\.•Unverifiable— the ground truth does not contain enough information to confirm or deny the claim\.Scoring —Let:•Nverifiable=Correct\+Incorrect\+Partially correctN\_\{\\text\{verifiable\}\}=\\text\{Correct\}\+\\text\{Incorrect\}\+\\text\{Partially correct\}•Ncorrect=N\_\{\\text\{correct\}\}=number of Correct verdicts•Npartial=N\_\{\\text\{partial\}\}=number of Partially correct verdictsscore=Ncorrect\+0\.5×NpartialNverifiable\(Nverifiable=0⇒score=0\.0\)\\text\{score\}=\\frac\{N\_\{\\text\{correct\}\}\+0\.5\\times N\_\{\\text\{partial\}\}\}\{N\_\{\\text\{verifiable\}\}\}\\qquad\(N\_\{\\text\{verifiable\}\}=0\\Rightarrow\\text\{score\}=0\.0\)Provide a brief per\-claim verdict, then end with exactly<score\>X\.XX</score\>whereX\.XXis a decimal in\[0\.00,1\.00\]\[0\.00,1\.00\]\.

You are playing a text\-based game\. Given the goal, your previous belief state, and the current observation, produce an updated belief state capturing what you know and how confidently you know it\.Goal:\{goal\}Previous belief state:\{previous\_belief\_state\}Current observation:\{current\_obs\}OutputONLYa belief state within<belief\_state\> </belief\_state\>tags\.— STRICT RULES —Your belief stateMUST NOTcontain: \(i\) any next action, plan, or intention; \(ii\) forward\-looking phrases: “I will”, “I should”, “my next step”, “I plan to”, etc\.; \(iii\) any recommendation about which command to execute\.ONLYrecord what you have already observed or can directly infer from past observations\.— UPDATE RULES —•If the current observationCONFIRMSa previous bullet→\\rightarrowupgrade it to “confirmed”\.•If the current observationCONTRADICTSa previous bullet→\\rightarrowreplace it immediately\.•If an expected element is not observed when new information is obtained→\\rightarrowdowngrade its likelihood\.•Never carry forward a stale bullet that conflicts with a direct observation\.— CERTAINTY SCALE —Every bulletMUSTcontain exactly one of these markers:confirmed / certaindirectly observed this turnalmost certainobserved previously, no contradicting evidence sinceprobableinferred from prior observations or consistent patternspossibleno direct evidence either wayunlikelyweak or missing supporting evidencedoubtfulcontradicting evidence existsunknownno evidence— FORMAT —One bullet per distinct fact, starting with “\- ”\. Every bullet contains exactly one certainty marker from the scale above\. No JSON, no percentages, no key\-value pairs\.Examples:•It isconfirmedthat I am in a situation where a surface and an object are present\.•An element previously observed isalmost certainlystill present as there is no contradicting evidence\.•There is aprobablerelationship between two elements based on prior observations\.•It ispossiblethat additional elements exist that have not yet been observed\.•An expected element isruled outin the current context as it is not observed\.•It isdoubtfulthat a previously assumed property holds given new evidence\.•It isconfirmedthat I am associated with an element obtained earlier\.

Two consecutive steps as the agent opens a door in the pantry \(game cooking\_30003\)\.Step A—Observation:You are now in the pantry\. A normal one\. What’s that over there? It looks like it’s a shelf\. The shelf is wooden\. However, the shelf, like an empty shelf, has nothing on it\. There is a closed plain door leading south\. You are carrying nothing\.Belief state:•It isconfirmedthat I am in the pantry\.•The south exit ispossiblethrough the closed plain door, as indicated by the door description\.•The shelf isprobablein the pantry, currently empty as observed\.•Whether the shelf contains any items isunknown\.•It isconfirmedthat I am carrying nothing\.•The location of the cookbook isunknown, as it has not been observed in the pantry\.•Whether the recipe is accessible from here isunknown\.Action:open plain door✓Step B—Observation:You are now in the pantry\. You open plain door\. You are carrying nothing\.Belief state:•It isconfirmedthat I am in the pantry\.•The south exit isconfirmedby the open plain door leading south\.•The shelf isprobablein the pantry, currently empty as observed\.•Whether the shelf contains any items isunknown\.•It isconfirmedthat I am carrying nothing\.•The location of the cookbook isunknown, as it has not been observed here\.•It isconfirmedthat the plain door is open, as directly observed\.Action:go south✓

The agent navigates through a corridor with a sliding patio door \(game cooking\_30002\)\.Step A—Observation:You are now in the corridor\. A normal one\. There is a closed sliding patio door leading east\. There is an exit to the north\. Don’t worry, there is no door\. You need an exit without a door? You should try going south\. There is an exit to the west\. You are carrying nothing\.Belief state:•It isconfirmedthat I am in the corridor\.•The east exit ispossiblethrough a closed sliding patio door, though it is currently closed and not open\.•Whether the sliding patio door can be opened isunknown, as it has not yet been interacted with\.•The north exit ispossible, destination not yet observed\.•The west exit ispossible, destination not yet observed\.•It isconfirmedthat I am carrying nothing\.Action:open sliding patio door✓Step B—Observation:You are now in the corridor\. You open sliding patio door\. You are carrying nothing\.Belief state:•It isconfirmedthat I am in the corridor\.•The east exit ispossiblethrough an open sliding patio door, since it is explicitly open\.•The sliding patio door isconfirmedopen, as directly observed after opening it\.•The north and west exits arepossible, destinations not yet observed\.•It isconfirmedthat I am carrying nothing\.Action:go east✓

After several cooking steps, the agent holds five distinct items \(game cooking\_30003\)\.Observation:You are now in the kitchen\. You take the red apple from the counter\. You are carrying: a red apple, a knife, a chopped burned red hot pepper, a sliced burned yellow potato and a cookbook\.Belief state:•It isconfirmedthat I am in the kitchen\.•The north open plain door isconfirmed\.•The red apple isconfirmedin my inventory, taken from the counter\.•The chopped burned red hot pepper isconfirmedin my inventory\.•The sliced burned yellow potato isconfirmedin my inventory\.•The cookbook isconfirmedin my inventory\.•The knife isconfirmedin my inventory\.•Whether any items remain on the counter isunknown\.•The required preparation state of the red apple isunknown, as the cookbook has not yet been read\.Action:slice red apple with knife✓
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

Similar Articles

Belief Memory: Agent Memory Under Partial Observability

BALAR : A Bayesian Agentic Loop for Active Reasoning

Preregistered Belief Revision Contracts

Milestone-Guided Policy Learning for Long-Horizon Language Agents

PolicyBank: Evolving Policy Understanding for LLM Agents

Submit Feedback

Similar Articles

Belief Memory: Agent Memory Under Partial Observability
BALAR : A Bayesian Agentic Loop for Active Reasoning
Preregistered Belief Revision Contracts
Milestone-Guided Policy Learning for Long-Horizon Language Agents
PolicyBank: Evolving Policy Understanding for LLM Agents