Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
Summary
The paper introduces the Belief Engine, an auditable belief-update layer for LLM agents that makes stance changes in multi-agent deliberation configurable and inspectable by treating belief as an evidential state with explicit update rules.
View Cached Full Text
Cached at: 05/18/26, 06:32 AM
# Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
Source: [https://arxiv.org/html/2605.15343](https://arxiv.org/html/2605.15343)
###### Abstract
LLM\-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi\-turn opinion exchange\. Yet generated transcripts often do not reveal why an agent’s stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context\. We introduce the Belief Engine \(BE\), an auditable belief\-update layer that treats “belief” as an evidential state over a proposition and exposes it as scalar stance\. BE extracts arguments into structured memory and updates stance with a log\-odds rule controlled by evidence uptakeuuand prior anchoringaa\. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence\-level update trail\. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence\-opposed cases instead point to anchoring or factors outside the extracted evidence stream\. BE provides configurable infrastructure for studying evidence\-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects\.
LLM agents, deliberation, belief updating, stance dynamics
Figure 1:Belief Engine architecture\. Incoming messages are extracted into structured arguments, judged, and stored as active evidence or archived records\. Active evidence updates the maintained belief state through evidence uptake and prior anchoring; responses are generated from the resulting stance plus retrieved memory and recent dialogue context\.## 1Introduction
Large Language Models \(LLMs\) are moving from single\-turn assistants to simulated participants in agent societies, education, civic deliberation, and collective\-decision settings\. Generative\-agent work showed how LLMs can sustain social simulation over time\(Parket al\.,[2023](https://arxiv.org/html/2605.15343#bib.bib31)\), while recent deliberation and augmented\-democracy studies use LLMs to model discussion, preference formation, and citizen input\(Yanget al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib65); Chuanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib6); Gudiñoet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib66)\)\. Digital\-twin proposals extend this ambition by treating synthetic communities as test beds for deliberative design\(Novelliet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib77)\)\. Across these settings, researchers need to know what agents say, and also whether, when, and why their stances change in response to reasons\. In multi\-agent deliberation, these update dynamics matter for cooperation and conflict resolution because they shape when agents listen, preserve commitments, converge, or remain apart\.
Because current LLM agents can generate fluent agreement and disagreement without exposing the source of that movement, attribution becomes central\. A transcript may look more moderate because the agent accepted evidence, because its persona drifted\(Choiet al\.,[2025b](https://arxiv.org/html/2605.15343#bib.bib41)\), because it echoed its interaction partner\(Shekkizharet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib42)\), or because inherited model biases shaped the response\(Taubenfeldet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib4)\)\. An agent may also appear resistant because retrieval or prompt context changed\. The underlying issue is that temporal continuity is not automatic in a standard LLM interaction: without an external memory and update rule, each turn is a new computation over the provided context rather than a persistent deliberative agent carrying experience forward\. For agent\-based simulations, the modelling target becomes unclear: what kind of deliberative subject is being represented?
We introduce the Belief Engine \(BE\), an auditable simulation\-control layer that puts log\-odds belief updating under experimental control\. We use “belief” in an operational Bayesian modelling sense: a proposition\-level evidential state maintained in log\-odds\. We usestancefor the scalar readoutS∈\[−1,1\]S\\in\[\-1,1\], with positive values supporting the proposition and negative values opposing it\. Initial seed arguments define the prior, later arguments provide weighted evidence, and both are accumulated through a log\-odds rule with two interpretable controls:evidence uptakeuuandprior anchoringaa\. This targets the component of deliberation that current LLM agents leave implicit: what counts as evidence, how much it can move the belief state, and how strongly an initial commitment persists\. Incoming arguments are extracted, judged, and stored in structured memory before they enter this update\. Language generation is conditioned on the updated stance and retrieved evidence\.
Separating the maintained belief state from generation lets researchers state modelling assumptions that are usually buried in prompts\. High uptake creates more evidence\-responsive agents, strong anchoring preserves initial commitments, and the stance trajectory can be checked against the evidence that produced it\. BE therefore turns deliberation simulation from prompt\-driven transcript production into a reportable state\-transition model: researchers can specify how evidence enters memory, how much it moves the belief state, and how the resulting stance trajectory conditions language\.
We evaluate BE with controlled parameter sweeps, matched prompt\-based baselines, and replay on 2,495 quality\-filtered DEBATE\(Chuanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib6)\)trajectories from human multi\-round discussions\. The sweeps test whetheruuandaacontrol evidence responsiveness across base models\. The prompt baselines test whether self\-revision and retrieval expose the same update process\. We use DEBATE replay to understand which kinds of human stance movement are explained by extracted evidence, rather than to claim that one profile should predict the whole population: under the same extracted\-evidence stream, which uptake–anchoring profiles reconstruct observed response regimes?
Concretely, BE makes three pieces of the simulation explicit\. The architecture separates argument extraction, evidence judgement, structured memory, belief updating, stance computation, and response generation\. The generated\-agent experiments show that two scalar controls, uptake and anchoring, produce predictable stance dynamics across multiple base models while preserving an evidence\-level audit trail\. The human replay protocol shows that the same interface can separate evidence\-explained movement, stable anchoring, and movement whose signal is absent from the extracted evidence stream\.
## 2Related work
LLM deliberators blur expression and state\.A central difficulty in LLM\-based social simulation is attribution: when an LLM agent changes what it says, the transcript alone does not show whether the movement should be treated as evidence\-responsive updating or as a surface shift caused by prompting, retrieval, or generation\. They exhibit persona shift\(Choiet al\.,[2025b](https://arxiv.org/html/2605.15343#bib.bib41); Shekkizharet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib42)\), knowledge\-conflict failures\(Xuet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib1)\), recency sensitivity\(Kimet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib2); Zhanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib5)\), regression to training biases\(Taubenfeldet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib4)\), and excessive convergence relative to humans\(Chuanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib6)\)\. DEBATE is especially relevant because it records both public interaction traces and private pre/post opinions, showing that plausible role\-play can still distort individual and group opinion dynamics\. Debate can also behave like a martingale without a specified update policy\(Choiet al\.,[2025a](https://arxiv.org/html/2605.15343#bib.bib17)\)\. Even if reasoning models internally simulate dialogic “societies of thought”\(Kimet al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib12)\), those implicit perspectives are not persistent, auditable social actors\. BE addresses this gap by maintaining a separate belief state with an explicit stance variable\.
Democratic simulations need inspectable change\.Applied systems increasingly use LLMs in deliberative settings: AI debate can support factual claim assessment\(Rahmanet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib74)\)and judicial tasks\(Huet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib76)\), while Agora and ArgueMate scaffold consensus\-finding in civic and educational contexts\(Pradeep Fulayet al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib73); Wanget al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib75)\)\. Augmented\-democracy and digital\-twin proposals extend this ambition by using LLMs or synthetic communities to estimate citizen preferences and test deliberative designs through controlled “what\-if” scenarios\(Gudiñoet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib66); Novelliet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib77)\)and DelibSim shows that LLM groups match human procedural discourse quality while failing to reproduce similar epistemic outcome dynamics\(Flechtner,[2026](https://arxiv.org/html/2605.15343#bib.bib72)\)\. We argue that, in such settings, the missing state variable becomes a substantive problem\. A deliberative simulator should report final accuracy, epistemic and discourse quality, and user learning, but it should also expose why simulated participants move, remain stable, or polarise\.
Memory does not determine belief revision\.Agent memory systems decide what can be retained, reflected on, or retrieved: Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.15343#bib.bib31)\), episodic memory banks\(Zhonget al\.,[2023](https://arxiv.org/html/2605.15343#bib.bib36)\), reflection systems\(Xuet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib47); Tanet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib53)\), graph memory\(Gutiérrezet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib37); Kanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib8); Huanget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib9)\), large\-scale simulations\(Piaoet al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib13); Xuet al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib14)\), and retrieval\-augmented debate\(Liet al\.,[2026](https://arxiv.org/html/2605.15343#bib.bib16)\)all strengthen some form of context persistence\. But remembering an argument is not the same as accepting it as evidence\. Memory can also amplify noise or reinforce experience\-following behaviour\(Xionget al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib10)\)\. BE places judgement and updating between memory and generation: stored arguments become belief\-relevant only when the evidence layer marks them active\.
Formal update rules need semantic grounding\.Bayesian models\(Olsson,[2013](https://arxiv.org/html/2605.15343#bib.bib22)\), DeGroot updating\(DeGroot,[1974](https://arxiv.org/html/2605.15343#bib.bib26)\), and bounded\-confidence models\(Deffuantet al\.,[2000](https://arxiv.org/html/2605.15343#bib.bib28); Hegselmann and Krause,[2002](https://arxiv.org/html/2605.15343#bib.bib27)\)make opinion dynamics explicit, and recent work connects LLM debate to Bayesian Nash equilibrium\(Xieet al\.,[2025](https://arxiv.org/html/2605.15343#bib.bib18)\)\. Their abstraction is also their limitation for language\-agent simulations: they rarely specify how natural\-language arguments are extracted, judged, remembered, or turned back into utterances\. Human belief updating also departs from ideal Bayesian assumptions\(Stengårdet al\.,[2022](https://arxiv.org/html/2605.15343#bib.bib19); Holt and Smith,[2009](https://arxiv.org/html/2605.15343#bib.bib20); Ashinoffet al\.,[2022](https://arxiv.org/html/2605.15343#bib.bib21)\)\. People may protect commitments, weigh evidence through identity and trust, and respond differently across social contexts\. The BE framework sits between these traditions\. It keeps the update rule parameterised, grounds each update in extracted arguments, and exposes profiles that can be calibrated against human deliberation trajectories\.
## 3Method
Figure[1](https://arxiv.org/html/2605.15343#S0.F1)depicts a BE agent as an LLM\-based debater whose proposition\-level belief state is maintained by the Belief Engine\. During a debate it exchanges messages with an opponent on a fixed topic\. The agent can use any compatible base model\. Across the generated\-agent comparisons we use GPT\-4o\-mini, GPT\-5\.4\-mini, Qwen 3\.5 9B, and Gemma 4 E4B in different roles, as detailed below\. In the generated\-agent conditions, debates start fromn=10n\{=\}10seeded arguments and use the same dialogue protocol while varying the uptake\-anchoring profile\(u,a\)\(u,a\)\. Each response is generated from three inputs: \(i\) current stance instruction, \(ii\) retrieved context, and \(iii\) recent dialogue context\. Since the belief state is updated by reported parameters and exposed asSS, the architecture makes it possible to study how simulated agents change state under argumentative pressure\.
### 3\.1Belief engine
The Belief Engine decouples belief maintenance from generative reasoning so that the agents’ belief updating can be controlled systematically\. For each new message, the engine follows the five\-step loop \(Fig\.[1](https://arxiv.org/html/2605.15343#S0.F1)\):Extract Arguments\(convert a message into candidate claims\),Judge Evidence\(score and merge near\-duplicate candidates\),Update Structured Memory\(store active and archived evidence\),Update Belief State\(apply the log\-odds rule and compute stanceSS\), andCompose Response\(retrieve active memory in proportion to its pro/con composition and mapSSto behavioural instructions for generation\)\.
The reported Belief Engine results use a deterministic log\-odds updater that recomputes the belief state from argument polarity, support strength, evidence uptakeuu, and prior anchoringaa, then exposes it as stanceSS\. Each stored argument has a binary active flagzi∈\{0,1\}z\_\{i\}\\in\\\{0,1\\\}, implemented ascredence\_relevant\. Only active records contribute to stanceS∈\[−1,1\]S\\in\[\-1,1\], separating record\-level evidence selection from the aggregate stance\.
#### Extract arguments\.
An independent LLM module decomposes each messagemtm\_\{t\}into candidate recordsℰ~t=J\(mt\)=\{e~i=\(ci,pi,ri\)\}i=1nt\\tilde\{\\mathcal\{E\}\}\_\{t\}=J\(m\_\{t\}\)=\\\{\\tilde\{e\}\_\{i\}=\(c\_\{i\},p\_\{i\},r\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{t\}\}, wherecic\_\{i\}is the canonical claim,pi∈\{−1,\+1\}p\_\{i\}\\in\\\{\-1,\+1\\\}is its polarity \(\+1\+1affirmative,−1\-1negative\), andri∈\{seed,self,opponent\}r\_\{i\}\\in\\\{\\texttt\{seed\},\\texttt\{self\},\\texttt\{opponent\}\\\}is the source role\. We use the same extractor across experiments, while the base debater model varies in generated\-agent comparisons\. Polarity is defined relative to the debate proposition, not sentiment\. A single message may yield zero, one, or several candidate arguments\. Extraction proposes candidate evidence records, it does not by itself determine whether a claim should affect the belief state\. Rather than restricting extraction to formal, logically structured arguments, we adopt a broader conception of deliberative communication that includes narratives, rhetorical forms, and experiential knowledge\(Young,[2002](https://arxiv.org/html/2605.15343#bib.bib67); Nakazawaet al\.,[2024](https://arxiv.org/html/2605.15343#bib.bib71)\)\. Accordingly, any persuasive contribution, whether formal reasoning, anecdotal evidence, or rhetorical framing, is treated as a valid unit of analysis\.
#### Judge evidence\.
The judgement layer governs how candidate arguments are transformed into structured evidence\. FollowingXionget al\.\([2025](https://arxiv.org/html/2605.15343#bib.bib10)\), we treat scoring and conflict resolution as separate architectural choices with explicit outputs\.Strength scoring\.Each extracted argument receives a strengths∈\[0,1\]s\\in\[0,1\]from either the LLM extractor \(siLLMs\_\{i\}^\{\\mathrm\{LLM\}\}\) or a classifier \(siclfs\_\{i\}^\{\\mathrm\{clf\}\}\)\. The classifier isDeBERTa\-v3\-large\(Heet al\.,[2021](https://arxiv.org/html/2605.15343#bib.bib23)\)fine\-tuned as a regressor on crowd\-rated argument\-quality labels\(Gretzet al\.,[2019](https://arxiv.org/html/2605.15343#bib.bib38)\), mapping each \(topic, claim\) pair to a scalar score in\[0,1\]\[0,1\]\. We setsi=siclfs\_\{i\}=s\_\{i\}^\{\\mathrm\{clf\}\}when classifier scoring is enabled andsi=siLLMs\_\{i\}=s\_\{i\}^\{\\mathrm\{LLM\}\}otherwise\. Sweeps may use either scorer, while DEBATE replay fixessi=siclfs\_\{i\}=s\_\{i\}^\{\\mathrm\{clf\}\}so calibration varies only belief\-update parameters\. Each record is then written asei=\(ci,pi,si,ri,zi\)e\_\{i\}=\(c\_\{i\},p\_\{i\},s\_\{i\},r\_\{i\},z\_\{i\}\)withzi←1z\_\{i\}\\leftarrow 1prior to conflict resolution\.Conflict resolution\.To avoid repeated paraphrases causing repeated updates, we soft\-deduplicate same\-polarity active claims\. For a new argument, we compare its claim embedding against active records with the same polarity\. If none exist, or if the nearest match has cosine similarity below the configured thresholdθ\\theta, the new record stays active\. Otherwise, we keep only the stronger record active and archive the other\. Records are retained for auditability, but only active ones \(z=1z=1\) affect the belief state\. Appendix[B](https://arxiv.org/html/2605.15343#A2)gives the replacement rule and reports howθ\\thetais set\.
#### Update structured memory\.
After conflict resolution, the BE agent stores each claim as a structuredArgumentRecordwith polarity, strength estimates, source role, andcredence\_relevantstatus\. Accepted evidence remains active, while superseded evidence is archived for traceability\. This selective consolidation supports record\-level replacement during conflict resolution and later retrieval by active pro/con memory composition\.
#### Update belief state\.
The Belief Engine exposes an uptake\-anchoring \(UA\) profile\(u,a\)\(u,a\):evidence uptakeuudetermines how strongly new evidence shifts the belief state, whileprior anchoringaasets the strength of the initial prior\.
Belief updating is modelled as additive evidence accumulation in log\-odds space over the currently active memory, with bounded stanceStS\_\{t\}\. Let𝒜t=\{i:zi=1\}\\mathcal\{A\}\_\{t\}=\\\{i:z\_\{i\}=1\\\}denote the active\-memory index set after judgement and conflict resolution at timett\. Rather than retaining stale contributions from arguments that have since been archived, the log\-odds state is recomputed from the active set\. Withγi=a\\gamma\_\{i\}=afor seed records \(ri=seedr\_\{i\}=\\texttt\{seed\}\) andγi=u\\gamma\_\{i\}=ufor later debate records \(ri≠seedr\_\{i\}\\neq\\texttt\{seed\}\), the update is
Lt\\displaystyle L\_\{t\}=∑i∈𝒜tpiln\(1\+siγi\),\\displaystyle=\\sum\_\{i\\in\\mathcal\{A\}\_\{t\}\}p\_\{i\}\\ln\\\!\\left\(1\+s\_\{i\}\\,\\gamma\_\{i\}\\right\),St\\displaystyle S\_\{t\}=2σ\(Lt\)−1=21\+exp\(−Lt\)−1,St∈\[−1,1\]\.\\displaystyle=2\\sigma\(L\_\{t\}\)\-1=\\frac\{2\}\{1\+\\exp\(\-L\_\{t\}\)\}\-1,\\quad S\_\{t\}\\in\[\-1,1\]\.\(1\)Seed records represent the prior, with no separate intercept:aasets how strongly seed records establish the initial belief state, whileuusets how strongly later debate evidence shifts it\. Evidence accumulates linearly in log\-odds, while the logistic transform yields bounded stance and diminishing returns near certainty\. Here, Bayesian refers to the log\-odds evidence\-accumulation form: the rule accumulates argument weights as additive evidence over prior seed records, with likelihood\-like quantities operationalised through argument extraction and evidence scoring\. If records are never archived or replaced, Eq\.[1](https://arxiv.org/html/2605.15343#S3.E1)reduces to an incremental update\. Archived records remain auditable but no longer affect the belief state\.
#### Compose response\.
After updating the belief state, the BE agent retrieves a bounded contextRtR\_\{t\}from active memory\. Letℳt\+\\mathcal\{M\}\_\{t\}^\{\+\}andℳt−\\mathcal\{M\}\_\{t\}^\{\-\}denote the active affirmative and negative records\. Retrieval allocatesk\+=round\(k\|ℳt\+\|/\(\|ℳt\+\|\+\|ℳt−\|\)\)k\_\{\+\}=\\operatorname\{round\}\\\!\\left\(k\|\\mathcal\{M\}\_\{t\}^\{\+\}\|/\(\|\\mathcal\{M\}\_\{t\}^\{\+\}\|\+\|\\mathcal\{M\}\_\{t\}^\{\-\}\|\)\\right\)slots to affirmative evidence andk−=k−k\+k\_\{\-\}=k\-k\_\{\+\}to negative evidence, with an even split when active memory is empty\.RtR\_\{t\}is the union of the strongestk\+k\_\{\+\}andk−k\_\{\-\}active records by support strength\. Retrieval therefore follows accepted evidence instead of directly amplifying scalar stance\. The next utterance is generated asyt∼LLM\(St,Rt,ht\)y\_\{t\}\\sim\\mathrm\{LLM\}\(S\_\{t\},R\_\{t\},h\_\{t\}\), whereStS\_\{t\}is mapped to a 10\-bin stance\-intensity instruction andhth\_\{t\}is recent dialogue context\.


Figure 2:Parameter control and profile dynamics\. StanceS∈\[−1,1\]S\\in\[\-1,1\], where\+1\+1means affirmative/pro with respect to the proposition and−1\-1means negative/con\.Left:BE single\-agent sweeps across three base LLMs, varying evidence uptakeuu\(a=0\.70a=0\.70\) and prior anchoringaa\(u=0\.4u=0\.4\)\.Right:GPT\-5\.4\-mini two\-agent profile debates with Open\-minded\(u,a\)=\(0\.40,0\.20\)\(u,a\)=\(0\.40,0\.20\)and Stubborn\(0\.10,0\.80\)\(0\.10,0\.80\)agents\. Thin lines show trials, thick lines show means\.
### 3\.2Experimental setup
We separate two empirical targets\. Generated\-agent experiments evaluate the full loop, including generation, judgement, memory, retrieval, and belief updating\. DEBATE replay isolates the update rule under real received\-evidence histories, giving a human\-grounded calibration test for the resulting stance dynamics\.
#### Generated\-agent experiments\.
Seed and counter\-agent arguments come from the Argument Quality Ranking dataset\(Gretzet al\.,[2019](https://arxiv.org/html/2605.15343#bib.bib38)\), which contains about 30,000 crowd\-annotated arguments with topic, polarity, and quality scores\. All generated debates run for 15 rounds withn=10n\{=\}10seed arguments per side and agent temperatureT=0\.7T\{=\}0\.7\. Symmetric two\-agent debate provides a controlled setting for testing update mechanics, profile dynamics, and evidence\-conditioned generation\.
In the main text, we focus on \(i\) five\-point parameter sweeps overuuoraaacross three base models \(five trials per setting\) and \(ii\) matched prompt baselines, namely prompt self\-update and RAG plus self\-update\. Appendix diagnostics add Open\-minded/Stubborn profile and topic\-grid demonstrations over 10 topics with three debates per profile\-pairing cell\. When a system does not exposeSS, a shared external LLM judge receives the proposition and generated text, then returns a scalar stance in\[−1,1\]\[\-1,1\]at temperature0\.00\.0\. Hyperparameters appear in Tab\.[A5](https://arxiv.org/html/2605.15343#A6.T5)\.
#### Human replay validation\.
DEBATE replay uses the benchmark as observed\-evidence replay rather than as one of its original simulation tasks\. For each participant, we initialise BE from the private initial Likert stance, replay directed evidence from partner tweets and received chat messages, and compare the final BE stance with the private final Likert stance\. Six\-point Likert responses are mapped linearly toS∈\[−1,1\]S\\in\[\-1,1\]\. We perform a grid search over\(u,a\)\(u,a\)and select settings by held\-out RMSE on final stance\. We report two references: a no\-change baseline that predicts the final stance from the initial stance, and a net\-evidence linear baseline,S^final=Sinitial\+βE\\hat\{S\}\_\{\\mathrm\{final\}\}=S\_\{\\mathrm\{initial\}\}\+\\beta E, that fits one scalar evidence weight on training folds only\. The linear baseline uses the same quality\-filtered received\-evidence stream, human\-like uptake and deduplication filter, and classifier strength scores as BE, but omits the Bayesian log\-odds belief\-update rule\. The paper\-facing replay excludes self\-authored posts and messages, isolating the belief\-update component under the evidence a participant received\.
## 4Results
The results ask three questions\. First, is the maintained stanceSSvisible in generated text strongly enough to support comparison with external stance scores? Second, do uptake and anchoring give predictable control across base models, and do prompt\-only self\-revision or RAG expose a comparable update trail? Third, when replayed on DEBATE, which final\-stance movements are explained by extracted received evidence, and which point beyond it? Generated\-agent experiments test controllability and auditability, while DEBATE replay tests how the same update interface captures human response heterogeneity\. Appendix[D](https://arxiv.org/html/2605.15343#A4)reports auxiliary diagnostics for the extractor and judgement layer\.
### 4\.1Uptake and anchoring control stance dynamics
We first validate that the maintained stanceSSis expressed in generated language\. We sweepS∈\[−1,1\]S\\in\[\-1,1\]and ask the independent judge used for non\-BE baselines to score each generated response\. The scores align tightly with the assigned stance \(r=0\.967r=0\.967,p<0\.001p<0\.001\), with a fitted slope of≈0\.86\\approx 0\.86indicating mild compression at the extremes\. Because the judge sees only text, it provides a calibrated behavioural score for systems that do not exposeSS\.
We next test the two controls directly\. We sweepevidence uptakeuuandprior anchoringaaacross five levels \(\{0\.2,…,1\.0\}\\\{0\.2,\\dots,1\.0\\\}\) in 15\-round debates against a deterministic counter\-argument opponent on the proposition “We should introduce compulsory voting”\. We repeat the sweep with three base language models \(GPT\-4o\-mini, Qwen 3\.5 9B, and Gemma 4 E4B\) to test whether the pattern depends on the generator\. Figure[2](https://arxiv.org/html/2605.15343#S3.F2)shows the intended ordering\. Higheruumakes agents more responsive to new evidence: for GPT\-4o\-mini, the final stance shifts from\+0\.81\+0\.81\(u=0\.2u\{=\}0\.2\) to−0\.49\-0\.49\(u=1\.0u\{=\}1\.0\), and Qwen and Gemma move even further in the high\-uptake setting \(−0\.96\-0\.96and−0\.99\-0\.99\)\. Higheraamakes the seeded prior more persistent\. Holdingu=0\.4u\{=\}0\.4, GPT\-4o\-mini ranges from−0\.37\-0\.37\(a=0\.2a\{=\}0\.2\) to\+0\.80\+0\.80\(a=1\.0a\{=\}1\.0\), and Qwen and Gemma follow the same pattern\.
Figure[2](https://arxiv.org/html/2605.15343#S3.F2)\(right\) demonstrates that the same rule could also create recognisable two\-agent profiles\. Open/Open agents reduce disagreement, Stubborn/Open pulls the open con agent toward the anchored pro agent, and Stubborn/Stubborn preserves more of the initial gap\. The endpoints differ by base model and trial because retrieval keeps active evidence from both sides available, so generation still affects the rate and magnitude of movement\. This is the desired behaviour for simulation: openness, asymmetric anchoring, and mutual anchoring become explicit settings rather than implicit prompt effects\.
Figure 3:Prompt\-baseline comparison on 15\-round atheism debates\. BE variants use fixed\(u,a\)=\(0\.2,0\.4\)\(u,a\)=\(0\.2,0\.4\)and expose internal stance; prompt self\-update and RAG plus self\-update use external\-judge scores on the same\[−1,1\]\[\-1,1\]scale\.
### 4\.2Prompt baselines show weaker convergence without update trails
Prompt baselines test whether self\-revision instructions and retrieval are enough to produce comparable convergence without a separately maintained state\. The matched two\-agent debates use 15 rounds, 3 trials, and initial stances±0\.75\{\\pm\}0\.75\(Fig\.[3](https://arxiv.org/html/2605.15343#S4.F3)\)\.
With the BE, memory, retrieval, and the update profile remain fixed at\(u,a\)=\(0\.2,0\.4\)\(u,a\)=\(0\.2,0\.4\)while the base generator varies\. All four BE conditions reduce the initial stance gap, with final gaps of0\.500\.50\(GPT\-4o\-mini\),0\.230\.23\(GPT\-5\.4\-mini\),0\.220\.22\(Qwen\), and0\.070\.07\(Gemma\), each traceable to active evidence under the shared profile\.
The GPT\-5\.4\-mini prompt self\-update and RAG plus self\-update baselines also move in the external\-judge score, but their convergence is weaker\. They end with wider Pro/Con gaps than most BE variants:\(\+0\.23,−0\.40\)\(\+0\.23,\-0\.40\)and\(\+0\.23,−0\.62\)\(\+0\.23,\-0\.62\)\. BE therefore adds what prompt\-level self\-revision lacks: an internal belief state, reported uptake and anchoring parameters, and an evidence\-level update trail\.
### 4\.3Human replay diagnoses which movements are explained by extracted evidence
DEBATE replay fixes the received\-evidence stream and fits only the update rule\. We use it as a diagnostic heterogeneity test: which uptake–anchoring profile reconstructs the observed final stance? The subgroup partitions are outcome\-conditioned diagnostics from observed final movement, so they should be read as replay analyses rather than pre\-outcome prediction rules\. Figure[4](https://arxiv.org/html/2605.15343#S4.F4)shows the calibration surface\. Each heatmap holds out final human stances for one participant subset, sweepsuuandaa, and colours each cell by how much its RMSE exceeds that panel’s best cell\. The pooled surface favours low uptake with moderate anchoring\. After partitioning by evidence alignment, the minima move to different regions: evidence\-aligned movers favour high uptake, evidence\-opposed movers favour near\-zero uptake, and stable participants favour near\-zero uptake with maximal anchoring\. The panel therefore shows why a single default profile is misspecified for a mixed population\.
Figure 4:DEBATE replay calibration surfaces\. Each heatmap sweeps evidence uptakeuuand prior anchoringaafor one participant subset\. Cells show held\-out RMSE above that panel’s best cell; lower is better\. Star sign marks the optimum, and the printed RMSE is the absolute error at that optimum\.Table 1:Held\-out DEBATE replay RMSE on mapped final stance\. No\-ch\. predicts initial stance, Linear fits a train\-fold scalar on signed extracted evidence, and BE uses the calibrated uptake–anchoring profile\. Gain is Linear minus BE;\|Δ\|\|\\Delta\|is mean absolute human movement\. Subgroup rows are outcome\-conditioned replay analyses, not pre\-outcome prediction rules\.Figure 5:Single\-participant DEBATE replay examples\. Each panel samples one participant from a diagnostic subset and applies that subset’s calibrated\(u,a\)\(u,a\)profile, shown above the panel\. The x\-axis counts accepted received\-evidence updates rather than debate rounds\. Starting from the observed human initial stance, the teal curve shows the BE replay after each accepted evidence item\. Markers indicate whether the extracted item pushes stance upward or downward\. The red dashed line connects only the observed human initial and final stance, since DEBATE does not measure intermediate human stances\.Table[1](https://arxiv.org/html/2605.15343#S4.T1)gives the held\-out error version and mean observed movement\. In the pooled population, the BE reduces RMSE from0\.4890\.489under no\-change and0\.4880\.488under net\-evidence linear replay to0\.4650\.465\. Outcome\-conditioned diagnostic gains are larger\. Evidence\-aligned and evidence\-opposed movers have similar movement magnitudes \(mean\|Δ\|=0\.619\|\\Delta\|=0\.619and0\.6330\.633\), but require different profiles: aligned movers are reconstructed by high uptake \(0\.3930\.393RMSE\), while opposed movers flag cases where extracted evidence points against observed movement\. Stable participants are not evidence of predictive improvement over no\-change, since no\-change is exact by definition for this group, they check that BE can represent resistance to extracted evidence without forcing movement\. The same diagnostic profile parameters outperform prior\-only and pooled global replay references in all five folds under both group\- and topic\-held\-out splits \(Appendix[H](https://arxiv.org/html/2605.15343#A8)\), supporting the interpretation that response heterogeneity is not captured by a single global\(u,a\)\(u,a\)setting\.
At the single\-participant level, Fig\.[5](https://arxiv.org/html/2605.15343#S4.F5)provides explanatory traces for the update mechanism: aligned endpoints show cases where accepted evidence accounts for movement, while divergence points to interpretation, social context, or arguments not captured by extraction\. These patterns separate cases that would otherwise look similar in a prompt\-only agent: movement with accepted evidence, stability under evidence pressure, and movement that the extracted evidence cannot explain\.
## 5Discussion and future work
These experiments suggest that the value of BE is not to make every deliberative trajectory predictable, but to make change attributable\. “Belief” here is an operational evidential state over a proposition, maintained in log\-odds and exposed through scalar stance\. Under this definition, researchers can specify what counts as evidence, how strongly it shifts the belief state, and how strongly initial commitments persist\. AQR diagnostics show reliable directed\-evidence extraction on clean single\-argument inputs \(96\.9% polarity accuracy\), DEBATE replay improves over no\-change and net\-evidence linear references while showing why a single profile is insufficient for a heterogeneous population, and the prompt comparison shows that externally scored stance change does not expose the update process itself\. This operational definition makes BE easier to compare across models, prompts, and replay settings, but it also limits what the framework can claim to model\. A single proposition\-level stance cannot capture every way people deliberate: they may partly agree, trade off values, distrust a speaker, or soften rhetorically without moving on the headline proposition\. These cases do not undermine the belief\-update layer, but they do show where the next modelling boundary lies\. A natural extension is a multidimensional state in which arguments and experiences are routed to different facets, and retrieval brings back the records relevant to each facet when the agent reasons or responds\. In that setting, RAG\-style or agentic memory could help simulate a broader range of human experience and opinion change while preserving the same inspectable update trail\.
The DEBATE replay also clarifies what calibration means for deliberative agents\. We do not expect one global profile to explain everyone, and averaging all participants together can hide behaviours that a simulator may need to represent\. Some people stay close to their initial stance, some move with the extracted evidence they receive, and others move in ways the current evidence stream does not explain\. BE makes these differences explicit rather than smoothing them away: a stable endpoint can be represented as anchoring, evidence\-aligned movement as uptake, and divergence from extracted evidence as a signal that some social or interpretive factor is missing\. Which profile is appropriate depends on the deliberative setting; our point is that these profiles can be separated, inspected, and reported rather than collapsed into a single population average\.
This matters because DEBATE is one deliberative context, not the template for all deliberation\. Its participants encounter partner posts and chat messages in a short online setting, and most do not show mapped Likert movement\. That stability is informative for online opinion\-exposure simulations, but it should not be confused with settings such as citizens’ assemblies, facilitated workshops, or in\-person deliberation, where participants may enter with an explicit norm of listening, compromise, and joint problem solving\. Separating uptake, anchoring, and response profiles lets BE change the assumed deliberative context: a citizens’ assembly simulation may contain more evidence\-responsive or compromise\-seeking agents, or agents whose movement appears first on sub\-issues and value trade\-offs rather than on the headline proposition\.
Finally, evidence assessment remains a modelling choice\. BE makes this choice explicit, but it does not make it value\-free\. LLM extractors, base models, external judges, and the AQR\-trained strength classifier can all reflect training\-data, safety\-tuning, and community\-specific priors\. This matters most for moral and political topics, where argument quality is contested\. For new domains, the judgement layer should be calibrated against domain data and, where appropriate, audited with human or participatory feedback\. The benefit of BE is that such choices are isolated in a reportable layer rather than hidden in prompting or generation\.
The broader implication is that a simulated persona should not be only a role description, it should also specify how the agent carries information across time\. A prompt can say who an agent is, while a memory\-and\-judgement process specifies what the agent treats as experience, which parts of that experience count as evidence, and how much they can revise persistent commitments\. For trustworthy deliberative AI, this traceability matters: convergence, polarisation, and stability should be accountable to visible update assumptions rather than inferred from plausible transcripts alone\.
## 6Conclusion
In deliberative multi\-agent systems, stance change becomes a modelling problem\. We presented the Belief Engine, an agentic framework that separates evidence extraction, judgement, memory, log\-odds updating, stance computation, and response generation, so that evidence uptake and prior anchoring can be set explicitly and audited through argument records\. Across tested base models, these controls make agents predictably more evidence\-responsive or more anchored\. DEBATE replay shows where BE is most useful: it best reconstructs participants whose final stance follows extracted received evidence\. BE therefore helps make deliberative\-agent stance dynamics traceable to explicit update assumptions\.
## Impact Statement
This work aims to make deliberative AI systems easier to study, compare, and govern\. If LLM agents are used in civic discussion, education, negotiation, or collective decision\-making, designers need to know when agents listen, preserve commitments, converge, or remain in disagreement, and why\. BE contributes to this goal by making evidence uptake, anchoring, and memory inspectable rather than hidden inside prompts\. This could support more transparent simulations, better auditing of agent behaviour, and systems that encourage reflection rather than unexamined prompt effects\. The main risk is over\-interpretation: synthetic trajectories, even when auditable, should not be treated as predictions of human opinion change or as evidence that a deliberative design will work in a real community without domain calibration and participatory validation\. We therefore view BE as infrastructure for transparent experimentation, not a substitute for human judgement or domain\-specific governance\.
## References
- B\. K\. Ashinoff, J\. Buck, M\. Woodford, and G\. Horga \(2022\)The effects of base rate neglect on sequential belief updating and real\-world beliefs\.PLOS Computational Biology18\(12\),pp\. e1010796\.External Links:ISSN 1553\-7358,[Document](https://dx.doi.org/10.1371/journal.pcbi.1010796)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- H\. K\. Choi, X\. Zhu, and S\. Li \(2025a\)Debate or vote: which yields better decisions in multi\-agent large language models?\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- J\. Choi, Y\. Hong, M\. Kim, and B\. Kim \(2025b\)Examining identity drift in conversations of LLM agents\.arXiv preprint arXiv:2412\.00804\.Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p2.1),[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- Y\. Chuang, R\. Tu, C\. Dai, S\. Vasani, Y\. Li, B\. Yao, M\. H\. Tessler, S\. Yang, D\. Shah, R\. Hawkins, J\. Hu, and T\. T\. Rogers \(2025\)DEBATE: a large\-scale benchmark for multi\-agent opinion dynamics\.arXiv preprint arXiv:2510\.25110\.Cited by:[Table A11](https://arxiv.org/html/2605.15343#A10.T11.4.2.1.3.1.1),[§1](https://arxiv.org/html/2605.15343#S1.p1.1),[§1](https://arxiv.org/html/2605.15343#S1.p5.2),[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- G\. Deffuant, D\. Neau, F\. Amblard, and G\. Weisbuch \(2000\)Mixing beliefs among interacting agents\.Advances in Complex Systems3\(01n04\),pp\. 87–98\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- M\. H\. DeGroot \(1974\)Reaching a consensus\.Journal of the American Statistical Association69\(345\),pp\. 118–121\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- M\. Flechtner \(2026\)Procedural parity, outcome mismatch: evaluating human vs llm deliberation\.InExtended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems,CHI EA ’26,Barcelona, Spain,pp\. 14\.External Links:[Document](https://dx.doi.org/10.1145/3772363.3798499)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- S\. Gretz, R\. Friedman, E\. Cohen\-Karlik, A\. Toledo, D\. Lahav, R\. Aharonov, and N\. Slonim \(2019\)A large\-scale dataset for argument quality ranking: construction and analysis\.arXiv preprint arXiv:1911\.11408\.Cited by:[Table A11](https://arxiv.org/html/2605.15343#A10.T11.4.3.2.3.1.1),[Appendix D](https://arxiv.org/html/2605.15343#A4.p1.1),[§3\.1](https://arxiv.org/html/2605.15343#S3.SS1.SSS0.Px2.p1.12),[§3\.2](https://arxiv.org/html/2605.15343#S3.SS2.SSS0.Px1.p1.2)\.
- J\. F\. Gudiño, U\. Grandi, and C\. Hidalgo \(2024\)Large language models \(llms\) as agents for augmented democracy\.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences382\(2285\),pp\. 20240100\.External Links:[Document](https://dx.doi.org/10.1098/rsta.2024.0100),[Link](https://doi.org/10.1098/rsta.2024.0100)Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p1.1),[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InACL,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- P\. He, J\. Gao, and W\. Chen \(2021\)DeBERTaV3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.arXiv preprint arXiv:2111\.09543\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2111.09543)Cited by:[Table A11](https://arxiv.org/html/2605.15343#A10.T11.4.4.3.3.1.1),[§3\.1](https://arxiv.org/html/2605.15343#S3.SS1.SSS0.Px2.p1.12)\.
- R\. Hegselmann and U\. Krause \(2002\)Opinion dynamics and bounded confidence: models, analysis, and simulation\.Journal of Artificial Societies and Social Simulation5\(3\)\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- C\. A\. Holt and A\. M\. Smith \(2009\)An update on Bayesian updating\.Journal of Economic Behavior & Organization69\(2\),pp\. 125–134\.External Links:ISSN 0167\-2681,[Document](https://dx.doi.org/10.1016/j.jebo.2007.08.013)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- T\. Hu, Z\. Tan, S\. Wang, H\. Qu, and T\. Chen \(2025\)Multi\-agent debate for llm judges with adaptive stability detection\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- Z\. Huang, Z\. Tian, Q\. Guo, F\. Zhang, Y\. Zhou, D\. Jiang, and X\. Zhou \(2025\)LiCoMemory: lightweight and cognitive agentic memory for efficient long\-term reasoning\.arXiv preprint arXiv:2511\.01448\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai \(2025\)Memory OS of AI agent\.arXiv preprint arXiv:2506\.06326\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- J\. Kim, S\. Lai, N\. Scherrer, B\. Agüera y Arcas, and J\. Evans \(2026\)Reasoning models generate societies of thought\.arXiv preprint arXiv:2601\.10825\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- M\. Kim, S\. Kim, and J\. Thorne \(2024\)From evidence to belief: a bayesian epistemology approach to language models\.arXiv preprint arXiv:2504\.19622\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- M\. Li, Z\. Wang, H\. Li, and J\. Liu \(2026\)R\-debater: retrieval\-augmented debate generation through argumentative memory\.InProceedings of AAMAS 2026,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- T\. Nakazawa, T\. Tatsumi, Y\. Souma, and S\. Ohnuma \(2024\)An effect of storytelling on attitude changes in deliberative mini\-publics\.Journal of Deliberative Democracy20\(1\)\.External Links:[Document](https://dx.doi.org/10.16997/jdd.1426)Cited by:[§3\.1](https://arxiv.org/html/2605.15343#S3.SS1.SSS0.Px1.p1.7)\.
- C\. Novelli, J\. Argota Sánchez\-Vaquerizo, D\. Helbing, A\. Rotolo, and L\. Floridi \(2025\)Testing deliberative democracy through digital twins\.SSRN preprint\.External Links:[Document](https://dx.doi.org/10.2139/ssrn.5193012)Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p1.1),[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- E\. J\. Olsson \(2013\)A Bayesian Simulation Model of Group Deliberation and Polarization\.InBayesian Argumentation,F\. Zenker \(Ed\.\),Vol\.362,pp\. 113–133\.External Links:[Document](https://dx.doi.org/10.1007/978-94-007-5357-0%5F6),ISBN 978\-94\-007\-5356\-3 978\-94\-007\-5357\-0Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p1.1),[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- J\. Piao, Y\. Yan, J\. Zhang, N\. Li, J\. Yan, X\. Lan, Z\. Lu, Z\. Zheng, J\. Y\. Wang, D\. Zhou, C\. Gao, F\. Xu, F\. Zhang, K\. Rong, J\. Su, and Y\. Li \(2026\)AgentSociety: large\-scale simulation of LLM\-driven generative agents advances understanding of human behaviors and society\.arXiv preprint arXiv:2502\.08691\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- S\. Pradeep Fulay, P\. Ravi, O\. Gokhale, E\. Yi, M\. A\. Bakker, and D\. Roy \(2026\)Agora: teaching the skill of consensus\-finding with ai personas grounded in human voice\.InProceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems,CHI EA ’26,pp\. 1–6\.External Links:[Link](http://dx.doi.org/10.1145/3772363.3798888),[Document](https://dx.doi.org/10.1145/3772363.3798888)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- S\. Rahman, S\. Issaka, A\. Suvarna, G\. Liu, J\. Shiffer, J\. Lee, M\. R\. Parvez, H\. Palangi, S\. Feng, N\. Peng, Y\. Choi, J\. Michael, L\. Jiang, and S\. Gabriel \(2025\)AI debate aids assessment of controversial claims\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- S\. Shekkizhar, R\. Cosentino, A\. Earle, and S\. Savarese \(2025\)Echoing: identity failures when LLM agents talk to each other\.arXiv preprint arXiv:2511\.09710\.Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p2.1),[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- E\. Stengård, P\. Juslin, U\. Hahn, and R\. van den Berg \(2022\)On the generality and cognitive basis of base\-rate neglect\.Cognition226,pp\. 105160\.External Links:ISSN 0010\-0277,[Document](https://dx.doi.org/10.1016/j.cognition.2022.105160)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- Z\. Tan, J\. Yan, I\. Hsu, R\. Han, Z\. Wang, L\. Le, Y\. Song, Y\. Chen, H\. Palangi, G\. Lee, A\. R\. Iyer, T\. Chen, H\. Liu, C\. Lee, and T\. Pfister \(2025\)In prospect and retrospect: reflective memory management for long\-term personalized dialogue agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8416–8439\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.413)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- A\. Taubenfeld, Y\. Dover, R\. Reichart, and A\. Goldstein \(2024\)Systematic Biases in LLM Simulations of Debates\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 251–267\.External Links:2402\.04049,[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.16)Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p2.1),[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- C\. Wang, F\. Forte, L\. Wettstein, P\. Dillenbourg, and T\. Wambsganss \(2026\)ArgueMate: designing an arguing agent with maximised disagreement to support student peer\-argumentation exercise\.InExtended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems,CHI EA ’26,Barcelona, Spain,pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.1145/3772363.3799300)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p2.1)\.
- Y\. Xie, Z\. Zhou, C\. Cao, Q\. Niu, T\. Liu, and B\. Han \(2025\)From debate to equilibrium: belief\-driven multi\-agent LLM reasoning via bayesian nash equilibrium\.InProceedings of ICML 2025,Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p4.1)\.
- Z\. Xiong, Y\. Lin, W\. Xie, P\. He, Z\. Liu, J\. Tang, H\. Lakkaraju, and Z\. Xiang \(2025\)How memory management impacts LLM agents: an empirical study of experience\-following behavior\.arXiv preprint arXiv:2505\.16067\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1),[§3\.1](https://arxiv.org/html/2605.15343#S3.SS1.SSS0.Px2.p1.12)\.
- R\. Xu, Z\. Qi, Z\. Guo, C\. Wang, H\. Wang, Y\. Zhang, and W\. Xu \(2024\)Knowledge conflicts for LLMs: a survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8541–8565\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.486)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for LLM agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- Y\. Xu, S\. Zhang, Y\. Zhou, S\. Zeng, L\. V\. S\. Lakshmanan, and C\. Ma \(2026\)Topology\-aware LLM\-driven social simulation: a unified framework for efficient and realistic agent dynamics\.arXiv preprint arXiv:2604\.18011\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
- J\. C\. Yang, D\. Dailisan, M\. Korecki, C\. I\. Hausladen, and D\. Helbing \(2024\)LLM voting: human choices and ai collective decision\-making\.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society7\(1\),pp\. 1696–1708\.External Links:[Document](https://dx.doi.org/10.1609/aies.v7i1.31758)Cited by:[§1](https://arxiv.org/html/2605.15343#S1.p1.1)\.
- I\. M\. Young \(2002\)Inclusion and democracy\.Oxford University Press\.External Links:[Document](https://dx.doi.org/10.1093/0198297556.001.0001),ISBN 9780198297550Cited by:[§3\.1](https://arxiv.org/html/2605.15343#S3.SS1.SSS0.Px1.p1.7)\.
- J\. Zhang, J\. Yang, and K\. Wang \(2025\)Large language models as discounted bayesian filters\.arXiv preprint arXiv:2512\.18489\.Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2023\)MemoryBank: enhancing large language models with long\-term memory\.arXiv preprint arXiv:2305\.10250\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2305.10250)Cited by:[§2](https://arxiv.org/html/2605.15343#S2.p3.1)\.
## Appendix ADesign Rationale Details
Table A1:Design rationale for the Belief Engine\. Each component gives researchers a separate handle on belief\-state change in deliberative\-agent simulations\.
## Appendix BConflict\-Resolution Details
The main text describes conflict resolution at a high level\. Here we give the exact rule used to decide whether a candidate argument remains active\. Letϕ\(c\)\\phi\(c\)be the embedding of claimccand define cosine similarity as
sim\(i,j\)=ϕ\(ci\)⊤ϕ\(cj\)‖ϕ\(ci\)‖‖ϕ\(cj\)‖\.\\operatorname\{sim\}\(i,j\)=\\frac\{\\phi\(c\_\{i\}\)^\{\\top\}\\phi\(c\_\{j\}\)\}\{\\\|\\phi\(c\_\{i\}\)\\\|\\,\\\|\\phi\(c\_\{j\}\)\\\|\}\.For a new argumenteie\_\{i\}, if no active same\-polarity record exists, we setzi=1z\_\{i\}=1\. Otherwise, let
j⋆=argmaxj:zj=1,pj=pisim\(i,j\)\.j^\{\\star\}=\\arg\\max\_\{j:z\_\{j\}=1,\\;p\_\{j\}=p\_\{i\}\}\\operatorname\{sim\}\(i,j\)\.Ifsim\(i,j⋆\)<θ\\operatorname\{sim\}\(i,j^\{\\star\}\)<\\theta, whereθ\\thetais set by the corresponding experiment configuration, the new argument remains active\. If the similarity exceeds the threshold, only the stronger of the two near\-duplicate records remains active:
\(zi,zj⋆\)=\{\(1,0\),si\>sj⋆,\(0,1\),si≤sj⋆\.\(z\_\{i\},z\_\{j^\{\\star\}\}\)=\\begin\{cases\}\(1,0\),&s\_\{i\}\>s\_\{j^\{\\star\}\},\\\\ \(0,1\),&s\_\{i\}\\leq s\_\{j^\{\\star\}\}\.\\end\{cases\}The archived record remains stored, but it is excluded from the active set used for belief updating and retrieval\. Generated\-agent thresholds are reported in Tab\.[A5](https://arxiv.org/html/2605.15343#A6.T5); DEBATE replay usesθ=0\.85\\theta=0\.85\.
## Appendix CFive\-Topic Prompt\-Based Check
The five\-topic mechanism check broadens the prompt\-based comparison beyond the main atheism example \(Tab\.[A2](https://arxiv.org/html/2605.15343#A3.T2)\)\. Each run pairs a Pro agent and a Con agent for 15 rounds, with fixed initial stance anchors\+0\.75\+0\.75and−0\.75\-0\.75, respectively\. We run three independent trials per topic–setup pair and summarise the final stance dynamics across the five topics: social media, journalism, compulsory voting, atheism, and libertarianism\.
The four Belief Engine rows use the same memory, retrieval, and update profile across topics, with evidence uptakeu=0\.2u=0\.2and prior anchoringa=0\.4a=0\.4; they differ only in the base language model generating debate turns\. The two non\-BE baselines use GPT\-5\.4\-mini\. In prompt self\-update, agents have no retrieval memory or Belief Engine and are instead prompted to update their stance after opponent turns\. In RAG plus self\-update, agents receive retrieved argument memory and prompt\-level self\-updating, but still do not use the Bayesian Belief Engine update\.
The five\-topic check keeps the same measurement caveat as Fig\.[3](https://arxiv.org/html/2605.15343#S4.F3): BE rows report internal stance, whereas baseline rows report external judge scores because those systems do not exposeSS\. The external judge is a calibrated proxy for expressed stance, not a substitute for an auditable internal trajectory\.
Table A2:Five\-topic prompt\-based mechanism check over 15\-round Pro/Con debates\. Each row averages three trials for one topic and setup\. BE variants share the same retrieval and update profile, withu=0\.2u=0\.2anda=0\.4a=0\.4, and differ only in the base debate model\. Prompt self\-update and RAG plus self\-update are GPT\-5\.4\-mini non\-BE baselines\. BE rows use internal stance; prompt\-based baselines use external judge scores\.As an illustrative measurement\-controlled check, Fig\.[A1](https://arxiv.org/html/2605.15343#A3.F1)rescored the generated journalism debates with the same external stance judge for all setups\. This removes the mixed\-measurement caveat in Tab\.[A2](https://arxiv.org/html/2605.15343#A3.T2): BE conditions are no longer shown with internal stance while prompt baselines are shown with external scores\. The figure should be read qualitatively, since it covers one topic with three trials per setup, but it supports the same pattern: most BE variants move both sides closer to the centre under the external judge, while prompt self\-update and RAG plus self\-update leave a larger final gap\.
Figure A1:External\-judge trajectory check on the journalism topic, “We should subsidize journalism\.” All panels use the same judged\-text stance metric, from−1\-1strongly con to\+1\+1strongly pro\. Thick lines show mean judged stance across three trials for Pro\-side and Con\-side agents; faint lines show individual trials and bands show across\-trial variation\. This figure is an illustrative comparability check rather than a new statistical result\.
## Appendix DJudgement Layer Validation Details
We evaluate the judgement layer on a 100\-row sample from the Argument Quality Ranking \(AQR\) dataset\(Gretzet al\.,[2019](https://arxiv.org/html/2605.15343#bib.bib38)\)\. AQR contains clean single arguments with crowd\-derived pro/con stance labels and crowd\-rated argument quality scores, but it does not label how much a human stance should move after reading each argument\. We therefore use AQR as a direct test of directed evidence extraction, and only an indirect test of update magnitude\.
The LLM judgement layer recovered the human pro/con stance of the strongest extracted claim with 96\.9% accuracy and 0\.969 macro F1, supporting its use as a directed\-evidence extractor\. Strength calibration is less automatic: raw LLM strength aligns weakly with AQR quality \(r=0\.089r=0\.089\), while the task\-specific local quality classifier aligns substantially better with human ratings \(r=0\.72r=0\.72\)\.
Table A3:Detailed judgement\-layer diagnostics on a 100\-row AQR sample\. AQR directly tests polarity recovery, but tests strength only as alignment with crowd\-rated speech quality, not belief\-update pressure\.
## Appendix EDEBATE Replay Protocol
DEBATE contains 2,792 U\.S\.\-based participants in four\-person discussions over 107 controversial topics: participants privately report an initial opinion and justification, exchange round\-wise tweet\-like posts and dyadic chat messages, and privately report a final opinion\. Our replay uses 2,495 quality\-filtered participant trajectories with usable pre/post stance reports and processable directed\-exposure histories\. Six\-point Likert responses are mapped linearly to the Belief Engine stance scaleS∈\[−1,1\]S\\in\[\-1,1\]\. For each participant, the mapped initial Likert stance initialises the prior\. Prior anchoringaascales that initial log\-odds state, and arguments extracted from directed partner tweets and received chat messages are replayed chronologically under evidence uptakeuu\. The replay does not generate new utterances or use retrieval\-conditioned response generation, so it isolates Eq\.[1](https://arxiv.org/html/2605.15343#S3.E1)under a fixed received\-evidence stream\.
The main replay uses directed received exposure, human\-like uptake and vector deduplication, classifier\-based argument strength, five held\-out group folds, and seed 42; the robustness appendix repeats the check with topic\-held\-out folds\. For each fold,\(u,a\)\(u,a\)is selected on training groups from the grid below and evaluated on held\-out groups by RMSE against the participant’s mapped final stance\.
The net\-evidence linear baseline is fit on the same group\-held\-out replay setup\. For participantjjin foldff, it predicts
S^j,final=Sj,initial\+βfEj,\\hat\{S\}\_\{j,\\mathrm\{final\}\}=S\_\{j,\\mathrm\{initial\}\}\+\\beta\_\{f\}E\_\{j\},whereEj=∑i∈ℛjpisiE\_\{j\}=\\sum\_\{i\\in\\mathcal\{R\}\_\{j\}\}p\_\{i\}s\_\{i\}is the signed net evidence over accepted received recordsℛj\\mathcal\{R\}\_\{j\}, after the same quality filtering, human\-like uptake/deduplication filter, and classifier\-strength scoring used by BE\. We fit one scalar coefficientβf\\beta\_\{f\}on training folds only, then evaluate on the held\-out fold\. The five fitted coefficients were small:0\.00830\.0083,0\.01050\.0105,0\.00800\.0080,0\.00890\.0089, and0\.00820\.0082\.
Table A4:Evidence\-alignment grouping used for DEBATE replay calibration\. These are outcome\-conditioned diagnostic groups used to analyse replay behaviour\.
## Appendix FExperimental Hyperparameters
Table A5:Hyperparameters and scoring settings for the generated\-agent experiments\. The Belief Engine controls are evidence uptakeuuand prior anchoringaa; values are fixed unless a row explicitly states a sweep or profile\-specific setting\.
## Appendix GEvidence\-Alignment Baseline RMSE
Table A6:Held\-out RMSE against no\-change and net\-evidence linear baselines, with mean absolute mapped pre/post human stance change\. Gain is net\-evidence linear RMSE minus BE RMSE\. Subgroup rows are post\-outcome diagnostic groups defined using observed final movement; they evaluate explanatory fit, not pre\-outcome profile prediction\. The 174 weak\-signal movers are included in All participants but omitted from the three\-profile summary\.
## Appendix HFold\-Level Robustness of DEBATE Replay
As a fold\-level robustness check, we compute one held\-out RMSE per outer fold for the group\-held\-out and topic\-held\-out DEBATE splits\. Confidence intervals arettintervals over five folds and should be read as robustness summaries rather than participant\-level uncertainty estimates\.
Table A7:Fold\-level robustness of evidence\-alignment parameters\. Negative deltas mean lower RMSE for evidence\-alignment parameters than the reference model\.Evidence\-alignment parameters have lower RMSE than both references in all five folds for both splits\. The diagnostic preset obtains lower aggregate RMSE; because it is outcome\-conditioned, we report it as replay analysis rather than a pre\-outcome calibration rule\.
## Appendix IAdditional Two\-Agent Topic Grids
The remaining topic\-level stance grids from the generated two\-agent topic run appear in Figs\.[A2](https://arxiv.org/html/2605.15343#A9.F2)–[A5](https://arxiv.org/html/2605.15343#A9.F5); Table[A8](https://arxiv.org/html/2605.15343#A9.T8)reports the corresponding topic\-level convergence values\.
Table A8:Mean convergence in the ten\-topic generated two\-agent topic run\. Convergence is the reduction in pro–con stance distance from the beginning to the end of the debate, averaged over three runs per topic–pairing cell\.

Figure A2:Additional generated two\-agent topic grids\. Top: Social media brings more harm than good\. Bottom: Entrapment should be legalized\.

Figure A3:Additional generated two\-agent topic grids\. Top: compulsory voting\. Bottom: austerity\.

Figure A4:Additional generated two\-agent topic grids\. Top: We should legalize sex selection\. Bottom: We should adopt atheism\.

Figure A5:Additional generated two\-agent topic grids\. Top: We should subsidize journalism\. Bottom: We should adopt a zero\-tolerance policy in schools\.
## Appendix JModel Identifiers, Prompt Families, and Release Plan
This appendix summarises the LLM\-facing implementation details needed to interpret the experiments\. We report model roles and prompt families here; the anonymous reproducibility artefact is available at[https://anonymous\.4open\.science/r/belief\-engine\-684F](https://anonymous.4open.science/r/belief-engine-684F)and contains the executable configs, exact prompt strings, compact result artefacts, and validation metadata\.
### J\.1Model identifiers
Table A9:Model identifiers and evaluator roles used in the reported experiments\. Exact resolved configs are included in the anonymous release\.
### J\.2Prompt families and call structure
Runtime values are filled from the resolved config, retrieved memory, and transcript\. The experiments use the prompt families in Table[A10](https://arxiv.org/html/2605.15343#A10.T10); the release artefact includes the exact byte\-level strings so the manuscript and executable code do not drift\.
Table A10:Prompt families used by the LLM\-facing components\.
### J\.3Existing assets and licences
Table[A11](https://arxiv.org/html/2605.15343#A10.T11)lists the external datasets, model families, and hosted services used in the reported experiments\. We cite datasets and model families where they are introduced in the main text\. The DEBATE source is[https://huggingface\.co/datasets/seantw/DEBATE\_LLM](https://huggingface.co/datasets/seantw/DEBATE_LLM); the other Hugging Face entries are repository identifiers\. The anonymous artefact does not redistribute third\-party datasets, hosted\-model weights, or raw proprietary API responses; it contains compact derived artefacts needed to verify the reported tables and figures\.
Table A11:Existing assets used in the reported experiments, with source, licence or access terms, and redistribution handling\.
### J\.4Code and data artefact
The anonymised release at[https://anonymous\.4open\.science/r/belief\-engine\-684F](https://anonymous.4open.science/r/belief-engine-684F)contains a reviewer\-facing reproducibility entry point, source code for the agents, memory, update rules, and evaluators, resolved configs, exact prompt strings, compact derived artefacts for the reported tables and figures, and checksum validation\. The artefact README includes smoke\-test and rebuild instructions, dataset\-access notes for the licensed datasets in Table[A11](https://arxiv.org/html/2605.15343#A10.T11), and a figure/table\-to\-artefact map\. For hosted\-model components, preserved artefacts are the primary reproducibility target; live reruns are treated as consistency checks rather than byte\-identical executions\.Similar Articles
The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation
This paper identifies the 'deliberative illusion' in multi-agent LLM systems, where discussion causes factual attrition and stance homogenization, and introduces DelibTrace to measure these phenomena, showing that up to 72% of critical facts can be lost during deliberation.
Belief Memory: Agent Memory Under Partial Observability
This paper introduces BeliefMem, a novel memory paradigm for LLM agents that stores multiple candidate conclusions with probabilities to handle partial observability and reduce self-reinforcing errors. Empirical evaluations show it outperforms deterministic baselines on LoCoMo and ALFWorld benchmarks.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
This paper introduces Agent-BRACE, a method that decouples LLM agents into belief state and policy models to handle long-horizon tasks in partially observable environments. By verbalizing state uncertainty, it achieves significant performance improvements over baselines while maintaining constant context window size.
@HuggingPapers: When should LLMs update, preserve, or ignore information? Contextual Belief Management is what long-horizon reasoning w…
Introduces BeliefTrack, a method for contextual belief management in LLMs, reducing reasoning failures by over 70%.
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
This paper introduces Contextual Belief Management (CBM) for LLMs to handle long-term information, proposes the BeliefTrack benchmark for evaluation, and demonstrates that reinforcement learning and representation-level steering significantly reduce belief management failures.