Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

arXiv cs.AI 06/02/26, 04:00 AM Papers

faithfulness-gap llm-agents reasoning chain-of-thought texas-holdem social-simulation process-fidelity

Summary

This paper decomposes the faithfulness gap in LLM agents into reasoning→conclusion and conclusion→action steps using Texas Hold'em poker as a controlled environment. It finds that the conclusion→action step is reliable, while the reasoning→conclusion step is the primary source of inconsistency.

arXiv:2606.00476v1 Announce Type: new Abstract: Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:47 PM

# Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents
Source: [https://arxiv.org/html/2606.00476](https://arxiv.org/html/2606.00476)
###### Abstract

Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists\. We study it in a controlled setting—a Texas Hold’em simulator with a verifiable reference action for every decision—by decomposing the faithfulness gap into two steps: reasoning→\\rightarrowconclusion \(does the stated decision follow from the agent’s own reasoning?\) and conclusion→\\rightarrowaction \(does the agent execute what it states?\)\. The two steps behave oppositely\. Conclusion→\\rightarrowaction is reliable: inconsistency is 0\.0–1\.8% across three model families, including a natively trained reasoning model, and the 22–26% reported by free\-text conclusion extraction is largely a measurement artifact\. The real gap is upstream—in 65% of erroneous decisions the agent estimates the inputs correctly and restates the rule, then draws a conclusion that contradicts it, and injecting the rule into the prompt does not help\. Process\-fidelity evaluation should therefore target the reasoning→\\rightarrowconclusion step and elicit machine\-checkable conclusions to avoid conflating measurement noise with model behavior\. We release code and a live multi\-agent demo\.

## 1Introduction

A fundamental concern in LLM\-based social simulation is whether agents genuinely act on the reasoning they appear to exhibit, or produce surface\-level rationalizations that do not govern behavior\(Parket al\.,[2023](https://arxiv.org/html/2606.00476#bib.bib3); Zhouet al\.,[2024](https://arxiv.org/html/2606.00476#bib.bib4)\)\. This concern is sharpened by a growing critique that the field deploys simulators faster than it validates them, and that reported results depend heavily on the elicitation protocol and measurement instrument used\(Puelma Touzelet al\.,[2026](https://arxiv.org/html/2606.00476#bib.bib8)\)\. The question is difficult to resolve in open social settings because ground truth is absent: no objective criterion exists for the correct action in a negotiation or a cultural exchange\.

Competitive games with a computable reference policy offer a different regime\. In Texas Hold’em poker, a reasonable reference action at each decision point can be derived from hand equity and pot odds—a fixed reference that is unavailable in most social simulation contexts\. This property renders poker a controlled calibration environment for*process fidelity*: when an LLM agent claims to reason about pot odds and hand strength, the executed action, the stated conclusion, and the intermediate reasoning can each be inspected against one another and against the reference\. Open\-ended social simulations—marketplaces, negotiations, multi\-agent societies—lack this verifiability, yet the agents that populate them are the same models, prompted in the same way\. A failure mode that is measurable in poker is therefore a candidate mechanism for the harder\-to\-measure breakdowns of persona consistency and emergent behavior reported in richer settings such as SOTOPIA\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00476#bib.bib4)\)\.

Using this setup, we decompose the “faithfulness gap” into two distinct steps and find that they behave very differently\. The first step,*conclusion→\\rightarrowaction*, asks whether the agent executes the decision it states\. The second step,*reasoning→\\rightarrowconclusion*, asks whether the stated decision follows validly from the agent’s own intermediate reasoning\. Prior work has reported that CoT explanations are frequently unfaithful\(Turpinet al\.,[2023](https://arxiv.org/html/2606.00476#bib.bib2)\)and that LLMs exhibit a “knowing\-doing gap” in game\-theoretic settings\(Linet al\.,[2026](https://arxiv.org/html/2606.00476#bib.bib6)\); our results refine this picture by showing that the gap is concentrated almost entirely in the second step\.

We study*process fidelity*in LLM agents using poker as a controlled strategic environment with observable intermediate reasoning and measurable reference decisions\. Across three LLM families \(Claude Haiku 4\.5, Gemini 2\.5 Flash\-Lite, and DeepSeek\-v4\-pro—the last a natively trained reasoning model\) and four prompt strategies, we address two research questions:

- •RQ1: How large is the conclusion→\\rightarrowaction inconsistency rate under game\-theoretic pressure, and how sensitive is the measured rate to how the conclusion is elicited and parsed?
- •RQ2: When the reasoning is made explicit, do agents derive conclusions that follow from their own stated inputs and rules; and does injecting the decision rule into the prompt improve adherence to it?

#### Contributions\.

- •A measurement caution for CoT faithfulness\.The measured stated\-vs\.\-actual inconsistency rate is an order of magnitude lower under explicit\-conclusion elicitation \(below 2%\) than under free\-text phrase extraction \(22–26%\), indicating that a substantial part of reported CoT “unfaithfulness” can reflect conclusion\-extraction noise rather than model behavior\.
- •A decomposition that localizes the gap\.Separating reasoning→\\rightarrowconclusion from conclusion→\\rightarrowaction shows the faithfulness gap is concentrated upstream: agents execute their stated decisions almost perfectly, but derive those decisions invalidly from their own stated reasoning\.
- •A characterization of the upstream failure\.The dominant error \(65% of failures\) is rule*misapplication*: agents estimate inputs correctly and restate the rule, then override it with qualitative hedging in a systematically risk\-averse direction\. Injecting the rule into the prompt does not help, isolating rule*application*—not retrieval—as the bottleneck\.
- •A portable process\-fidelity instrument\.We package poker as a verifiable, reproducible testbed with a step\-level fidelity probe \(open harness and live multi\-agent demo\) that can be ported to richer social simulations to detect persona drift and surface\-level rationalization before they propagate\.

![Refer to caption](https://arxiv.org/html/2606.00476v1/x1.png)Figure 1:Overview of the experimental framework\.\(A\)Four LLM agents occupy distinct strategy seats at a Texas Hold’em table; an equity\-based threshold heuristic provides a reproducible reference action for every decision\.\(B\)The decision pipeline decomposes into two steps: reasoning→\\rightarrowconclusion \(does the stated decision follow from the agent’s own inputs and rule?\) and conclusion→\\rightarrowaction \(does the executed action match the stated decision?\)\.\(C\)The measured inconsistency lives almost entirely in the upstream step: conclusion→\\rightarrowaction inconsistency is near zero once the conclusion is elicited explicitly, while reasoning→\\rightarrowconclusion errors are frequent\.The answers bear directly on the central concern of this workshop: distinguishing genuine strategic behavior from model artifacts\. We find that apparent stated\-vs\.\-actual unfaithfulness is, in this setting, sensitive to how the conclusion is measured, while a less commonly measured failure—invalid reasoning that the agent nonetheless executes faithfully— is both real and substantial\.

## 2Related Work

#### Chain\-of\-thought faithfulness\.

Weiet al\.\([2022](https://arxiv.org/html/2606.00476#bib.bib1)\)demonstrated that chain\-of\-thought prompting substantially improves LLM performance on multi\-step reasoning tasks\.Turpinet al\.\([2023](https://arxiv.org/html/2606.00476#bib.bib2)\)showed that CoT explanations can be unfaithful— models influenced by spurious biasing features fail to mention those influences while still acting on them—a finding established through careful controlled manipulation rather than free\-text conclusion extraction\. Our results complement this line of work with a methodological observation specific to our setting: when the stated conclusion is*inferred*from free\-text reasoning rather than controlled directly, the measured stated\-vs\.\-actual inconsistency rate is highly sensitive to the extraction method, varying by an order of magnitude between a phrase parser and an explicit decision tag\. This concerns how the gap is operationalized, not whether CoT can be unfaithful\.

#### LLM multi\-agent social simulation\.

Parket al\.\([2023](https://arxiv.org/html/2606.00476#bib.bib3)\)showed that LLM agents equipped with memory, reflection, and planning can produce emergent social behaviors in sandbox environments\. SOTOPIA\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00476#bib.bib4)\)introduced an interactive benchmark for social intelligence, finding that GPT\-4 achieves significantly lower goal\-completion rates than humans on challenging social scenarios\.Puelma Touzelet al\.\([2026](https://arxiv.org/html/2606.00476#bib.bib8)\)survey this fast\-growing area and argue that it must pivot from expansion to consolidation around reproducible, validated evaluation—calling for*operational validity*, the property that a simulator reproduces rather than merely resembles the target phenomenon\. These works largely evaluate*outcome*quality, or argue for validation in the abstract\. We complement them with a concrete, verifiable*process*\-fidelity measurement at the level of individual reasoning steps\.

#### Strategic reasoning and the knowing\-doing gap\.

GameBench\(Costarelliet al\.,[2024](https://arxiv.org/html/2606.00476#bib.bib5)\)evaluated LLMs across nine strategic game environments, finding that none matched human performance\.Linet al\.\([2026](https://arxiv.org/html/2606.00476#bib.bib6)\)examined LLMs in Texas Hold’em specifically, identifying a “knowing\-doing gap” between reasoning traces and decisions, and addressed it via external solver tools\.Xieet al\.\([2026](https://arxiv.org/html/2606.00476#bib.bib7)\)introduced M3\-BENCH and observed an “overthink\-undercommunicate” pattern\. Our work localizes the knowing\-doing gap: it is not that agents fail to execute their stated decisions, but that they fail to derive correct conclusions from reasoning they themselves produce\.

## 3System and Methodology

### 3\.1The Agent Harness

We construct a Texas Hold’em simulator in which every seat is occupied by an LLM agent\. The harness comprises three components\. Thearena\(engine/game\.py\) manages shared game state—deck, community cards, pot, stacks, and betting order—advancing after each agent action\. Theagent interface\(ai/llm\_player\.py\) exposes a singledecide\(game\)→\\rightarrowactionmethod\. Theobserver layer\(experiment/\) logs every decision alongside a reference action from an equity\-based threshold heuristic, enabling process\-level evaluation independent of chip outcomes\. Stacks are reset to 10,000 chips \(1,000 big blinds\) at the start of each hand, rendering each hand an independent observation and ensuring equal participation across all four strategy seats\.

### 3\.2Prompt Strategies

Each seat is assigned one of four system prompt strategies, held constant throughout the experiment\. Every strategy receives the same per\-decision user prompt describing the game state \(FigureLABEL:fig:prompts, bottom\) and differs only in its system prompt\. The strategies are designed to vary one factor: how much, and what kind of, decision guidance is supplied externally\.

- •Baseline: a minimal expert framing with no reasoning requested\. The model is told only to “accumulate chips by making \+EV decisions” and to “respond with valid JSON only\.” This isolates the model’s default behavior absent any elicited reasoning\.
- •CoT: identical framing, but the model is instructed to “think step by step: assess your hand strength, estimate pot odds, consider opponents,” then to emit an explicitDECISION: <fold\|check\|call\|raise\>line before the JSON\. The explicit decision line is what makes the conclusion machine\-checkable \(Section 4\.1\)\.
- •Persona: a behavioral style is imposed rather than a reasoning procedure: “You are a tight\-aggressive \(TAG\) player…play only premium hands aggressively, fold marginal hands, and apply pressure from position\.” No reasoning is requested\.
- •GTO\-constrained: the decision rule itself is injected into the prompt: “estimate equity vs pot odds…fold if equity<<pot odds; call or raise if equity\>\>pot odds\.” This tests whether handing the model the rule improves adherence to it\.

For the reasoning→\\rightarrowconclusion analysis we add a diagnostic strategy,GTO\-verbose, which instructs the agent to emit four labeled lines before its JSON—EQUITY,POT\_ODDS,RULE, andDECISION—so that the agent’s own estimated inputs, its restatement of the rule, and its stated decision can each be checked against the others and against the Monte Carlo reference\. The verbatim system prompts and an example input are listed in Appendix[A](https://arxiv.org/html/2606.00476#A1)\.

### 3\.3Metrics

GTO adherenceis computed per decision by a Monte Carlo equity calculator \(500 simulations\) combined with an equity\-based threshold heuristic: fold if equity<<pot\_odds; call if\|\|equity−\-pot\_odds\|\|<<0\.05; raise otherwise\. We employ this simplified heuristic as a reproducible, fixed reference point rather than an exact game\-theoretically optimal policy; exact equilibrium computation in multi\-player Hold’em is intractable at this scale\. It is applied identically to all agents and strategies\.

Conclusion→\\rightarrowaction consistencyapplies to the CoT and GTO\-verbose strategies, which emit an explicitDECISIONtag\. A decision is flagged*inconsistent*when the tagged conclusion contradicts the JSON action\. To quantify the sensitivity of this measurement, we compare the explicit\-tag parser against a baseline regular\-expression parser that infers the conclusion from free\-text phrases such as “should raise” or “best to call”\.

Reasoning→\\rightarrowconclusion validityapplies to the GTO\-verbose strategy\. For each decision we check whether the statedDECISIONfollows from the agent’s own statedEQUITYandPOT\_ODDSunder the rule it restated\.

Figure[2](https://arxiv.org/html/2606.00476#S3.F2)summarizes how a single decision flows through the two paths and where each measurement is taken\. Note that the Monte Carlo equity serves a dual role: it determines the reference action \(for GTO adherence\) and provides the independent yardstick against which the agent’s*stated*equity is checked \(for reasoning→\\rightarrowconclusion validity\), which is what lets us separate input errors from rule misapplication\.

![Refer to caption](https://arxiv.org/html/2606.00476v1/x2.png)Figure 2:Per\-decision measurement pipeline\. Each game state is processed by two independent paths: the*agent path*\(top\), which produces a reasoning trace, an explicit stated conclusion, and an executed action; and the*reference path*\(bottom\), in which a 500\-rollout Monte Carlo estimate of equity, combined with pot odds, yields a reference action via a fixed threshold rule\. Three comparisons \(right\) define the metrics: conclusion→\\rightarrowaction consistency \(RQ1\), GTO adherence \(RQ2\), and reasoning→\\rightarrowconclusion validity \(the GTO\-verbose diagnostic of Section 4\.2\)\. The Monte Carlo equity feeds both the reference action and the input\-accuracy check\.
### 3\.4Experimental Setup

Two independent 200\-hand experiments are conducted withclaude\-haiku\-4\-5\-20251001andgemini\-2\.5\-flash\-lite, and a third 50\-hand experiment withdeepseek\-v4\-pro\(reduced hand count owing to the higher per\-decision latency of a reasoning model\), each at the model\-default temperature\. A separate 50\-hand GTO\-verbose diagnostic is run with Claude Haiku 4\.5\. Starting stack: 10,000 chips; big blind: 10 chips; Monte Carlo simulations: 500 per decision\. Each run places one seat per strategy at a single table; all four seats use the same underlying model\. Code and decision logs are available at[https://github\.com/louiswang524/texas\_poker](https://github.com/louiswang524/texas_poker)\.

## 4Results

![Refer to caption](https://arxiv.org/html/2606.00476v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.00476v1/x4.png)

Figure 3:\(a\)Conclusion→\\rightarrowaction inconsistency under two parsers\. A free\-text phrase parser reports 22–26%; an explicit decision tag reduces the same quantity to under 2%, revealing the former as a measurement artifact\.\(b\)Where erroneous GTO\-verbose decisions come from: most arise from misapplying a correctly stated rule to correctly estimated inputs, not from input miscalculation\.### 4\.1Conclusion→\\rightarrowAction Inconsistency Is a Measurement Artifact \(RQ1\)

Table[1](https://arxiv.org/html/2606.00476#S4.T1)reports the inconsistency between the stated conclusion and the executed action under two parsers\. With a free\-text phrase parser, the rate is 26\.4% \(Haiku\), 22\.0% \(Flash\-Lite\), and 33\.3% \(DeepSeek\-v4\-pro\), and a large fraction of decisions cannot be parsed at all\. With an explicitDECISIONtag, the parse rate reaches 98–100% and the inconsistency collapses to 0\.3%, 1\.8%, and 0\.0% respectively\. The effect is largest for DeepSeek: its native reasoning traces are especially verbose, mentioning many candidate actions in passing, which yields both the lowest free\-text parse rate \(44%\) and the highest extraction noise—and yet near\-perfect consistency \(0\.0%\) once the conclusion is read from the explicit tag\.

Table 1:Conclusion→\\rightarrowaction inconsistency by parser, with 95% Wilson confidence intervals\. The free\-text parser overstates inconsistency by roughly an order of magnitude\.The agents almost always execute the decision they state\. The apparent unfaithfulness reported by the phrase parser arises because free\-text reasoning mentions several candidate actions in passing \(“I could call here, but raising is stronger”\), and a last\-match heuristic frequently extracts the wrong one\. This answers RQ1: the conclusion→\\rightarrowaction gap is small, and the larger rates reported by free\-text phrase parsing reflect conclusion extraction\.

### 4\.2Errors Arise from Misapplied Reasoning, Not Bad Inputs \(RQ2\)

The GTO\-verbose diagnostic exposes the agent’s own estimated inputs and restated rule\. Across 791 decisions, overall adherence to the reference action is 60\.6%\. Two observations locate the failures\. First, the inputs are mostly correct: the agent’s stated equity falls within 0\.15 of the Monte Carlo value in 68% of decisions\. Second, among interpretable erroneous decisions, 65% are*rule\-misapplications*—the agent states the correct rule and plausible inputs, then announces a decision that does not follow from them—while only 19% stem from an equity estimate that is badly wrong, and 16% are borderline cases near the decision threshold\.

A representative example, reproduced verbatim from the logs:

> EQUITY: 0\.42 POT\_ODDS: 0\.25 RULE: fold if EQUITY < POT\_ODDS; raise if EQUITY \> POT\_ODDS \+ 0\.05; otherwise call\. “Since EQUITY \(0\.42\) \> POT\_ODDS \(0\.25\), the call is profitable\. However, since EQUITY \(0\.42\) is not significantly above POT\_ODDS \+ 0\.05 \(0\.30\), a raise is not justified\. The correct decision is to call\.” DECISION: call

The agent computes the threshold correctly \(0\.25\+0\.05=0\.300\.25\+0\.05=0\.30\) but then judges0\.420\.42to be “not significantly above”0\.300\.30, although0\.42\>0\.300\.42\>0\.30by a clear margin; the reference action israise\. The failure is neither an input error nor an execution error—it is an invalid inference from the agent’s own correctly stated premises\.

This pattern is unlikely to be a simple arithmetic mistake\. If the agents could not perform the numerical comparison, errors would be roughly symmetric; instead they are strongly directional\. In 97% of rule\-misapplications the stated rule impliesraiseyet the agent chooses a passive action \(call, check, or fold\), and the violated margin is large rather than borderline \(mean stated equity exceeds the raise threshold by 0\.22, and by more than 0\.10 in 76% of cases\)\. The model thus computes the inputs and the threshold correctly but lets a qualitative judgment—hedging language such as “not significantly above” or “a weak hand”—override the quantitative rule it just stated, in a consistently risk\-averse direction\.

### 4\.3Injecting the Rule Does Not Improve Adherence

Table[2](https://arxiv.org/html/2606.00476#S4.T2)reports GTO adherence by prompt strategy across all three models\. Aχ2\\chi^\{2\}test confirms that strategy significantly affects adherence in the larger runs \(Haiku:χ2=8\.5\\chi^\{2\}=8\.5,p=0\.04p=0\.04; Flash\-Lite:χ2=28\.9\\chi^\{2\}=28\.9,p<0\.001p<0\.001; the smaller DeepSeek run shows the same ordering but does not reach significance on its own\)\. Across all three models, the GTO\-constrained strategy never attains the highest adherence and consistently sits at or near the bottom \(47\.9–54\.6% across all runs\)\. Externalizing the decision rule into the prompt therefore does not improve compliance with that rule, consistent with the rule\-misapplication mechanism above: the limiting factor is applying a rule, not knowing it\. This echoes a recent real\-world marketplace experiment in which Claude agents negotiated on behalf of people\(Anthropic,[2025](https://arxiv.org/html/2606.00476#bib.bib9)\): instructing agents to negotiate more aggressively produced little change in outcomes, while weaker agents underperformed in a way that was imperceptible to the people they represented—a behavioral gap that outcome\-level observation did not reveal\.

Table 2:GTO adherence rate by strategy across three model families \(explicit\-tag runs\)\. Cells show adherence % with decision countnn\. DeepSeek results are from a shorter 50\-hand run \(vs\. 200 for the others\) due to its higher per\-decision latency, hence the smallernnand wider intervals\. In every model the GTO\-constrained strategy ranks at or near the bottom; it never attains the top\.

## 5Discussion

#### Where the gap actually is\.

Decomposing the decision pipeline shows that LLM poker agents are reliable at the conclusion→\\rightarrowaction step and unreliable at the reasoning→\\rightarrowconclusion step\. For social simulation this is an encouraging and a cautionary message at once: an agent’s final stated decision is a faithful predictor of what it will do, but the reasoning leading to that decision is not a reliable derivation, even when every intermediate quantity is stated and approximately correct\. Evaluations that only check whether an action matches a stated conclusion will see near\-perfect fidelity and miss the real failure mode\.

#### Implications for persona modeling and emergent behavior\.

The same decomposition predicts a specific failure for persona\-driven agents\. A persona—“tight\-aggressive,” “risk\-averse,” “cooperative”—is a stated disposition meant to shape behavior through reasoning\. Our results show that even a fully stated, correctly computed rationale need not yield a conclusion consistent with it: in poker, a stated equity rule was overridden by qualitative hedging in a systematically risk\-averse direction\. In a richer social simulation, an analogous override would surface as*persona drift*—an agent that articulates its persona faithfully while its actions are pulled toward a generic, conservative prior independent of the persona it states\. Because such drift is invisible to outcome\-level and conclusion\-level checks, emergent group phenomena built on these agents—norm formation, coalition dynamics, information cascades—may reflect this shared prior rather than the configured personas, yielding model\-bias artifacts in place of substantive social behavior\. Step\-level process\-fidelity probes of the kind used here offer one way to detect such drift before it propagates through a simulation\.

#### Measurement sensitivity\.

The order\-of\-magnitude difference between the two parsers \(Table[1](https://arxiv.org/html/2606.00476#S4.T1)\) is itself a finding\. Reported CoT unfaithfulness rates are sensitive to how the conclusion is elicited and extracted; free\-text phrase extraction systematically overstates inconsistency\. This is a concrete instance of a general concern that results in this area depend strongly on the elicitation protocol and measurement instrument\(Puelma Touzelet al\.,[2026](https://arxiv.org/html/2606.00476#bib.bib8)\): here, the same behavior measured two ways differs by an order of magnitude\. We recommend that studies of stated\-vs\.\-actual fidelity elicit an explicit, machine\-checkable conclusion rather than inferring it from prose\.

#### Limitations and future work\.

The three\-model scope, while spanning two prompted models and one natively trained reasoning model \(DeepSeek\-v4\-pro\), is still limited; the DeepSeek run also uses fewer hands, widening its confidence intervals\. That the decomposition holds for a native reasoner—whose chain\-of\-thought is produced without explicit prompting—is reassuring but not conclusive, and broader model coverage remains future work\. The rule\-misapplication analysis relies on the GTO\-verbose elicitation, which may itself alter behavior relative to silent reasoning, and was run only on Claude Haiku 4\.5; replicating it across models is a natural next step\. Poker is deliberately not a full social simulation: it lacks natural\-language interaction, persuasion, and norm formation, and we do not claim it substitutes for them\. The contribution is instead methodological\. Poker supplies what richer social simulations lack—a verifiable per\-decision reference—and thereby exposes a step\-level fidelity probe \(does the stated conclusion follow from the stated reasoning?\) that is otherwise hard to construct\. That probe is portable: any social simulation for which an approximate reference behavior can be defined can apply the same decomposition\. We therefore treat these experiments as a calibration step toward better process\-fidelity evaluation in social simulation, not as a comprehensive evaluation of social behavior\.

#### Demo\.

We provide a browser\-based demonstration \(FastAPI \+ WebSocket\) with three modes\. The*AI Spectator*mode is the interactive counterpart of our experiment: attendees watch Claude, Gemini, and DeepSeek agents play one another at a single table, with each agent’s per\-decision reasoning trace, explicit stated conclusion, Monte Carlo equity, and executed action surfaced in real time\. This makes both pipeline steps directly observable as they happen—whether the executed action matches the stated conclusion \(conclusion→\\rightarrowaction\), and whether that conclusion follows from the agent’s own reasoning \(reasoning→\\rightarrowconclusion\)\. Two human\-facing modes reuse the same harness against LLM opponents:*Training*mode shows the player their equity, pot odds, and a suggested action before each decision;*Practice*mode delivers a post\-hand breakdown of every decision with the expected\-value cost of any deviation from the reference action\. To run it, set an LLM provider API key and launch the bundled web server from the repository root\. The system and a live demo are available at[https://github\.com/louiswang524/texas\_poker](https://github.com/louiswang524/texas_poker)\.

## 6Conclusion

We decomposed the faithfulness gap in LLM poker agents into two steps and found that they behave oppositely\. Conclusion→\\rightarrowaction inconsistency is near zero \(0\.3–1\.8%\) once the conclusion is elicited explicitly; the 22–26% rates reported by free\-text phrase\-extraction parsing are a measurement artifact\. The real gap is upstream, in reasoning→\\rightarrowconclusion: agents estimate inputs correctly and restate the rule, yet misapply it in 65% of erroneous decisions, and injecting the rule into the prompt does not help\. Process\-fidelity evaluation for LLM\-based social simulation should therefore target the derivation of conclusions from reasoning, and should elicit machine\-checkable conclusions to avoid conflating measurement noise with model behavior\.

## References

- Anthropic \(2025\)Project deal\.Note:[https://www\.anthropic\.com/features/project\-deal](https://www.anthropic.com/features/project-deal)Cited by:[§4\.3](https://arxiv.org/html/2606.00476#S4.SS3.p1.5)\.
- A\. Costarelli, M\. Allen, R\. Hauksson, G\. Sodunke, S\. Hariharan, C\. Cheng, W\. Li, J\. Clymer, and A\. Yadav \(2024\)GameBench: evaluating strategic reasoning abilities of LLM agents\.arXiv preprint arXiv:2406\.06613\.Cited by:[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Lin, E\. Dai, H\. Liu, X\. Tang, Y\. Yan, Z\. Dai, J\. Zeng, Z\. Zhang, F\. Wang, H\. Gao, C\. Luo, X\. Zhang, Q\. He, and S\. Wang \(2026\)How far are LLMs from professional poker players? Revisiting game\-theoretic reasoning with agentic tool use\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.00476#S1.p3.2),[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the ACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§1](https://arxiv.org/html/2606.00476#S1.p1.1),[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Puelma Touzel, S\. Sarangi, A\. Bück\-Kaeffer, Z\. Yang, J\. Godbout, and R\. Rabbany \(2026\)Position: time to close the validation gap in LLM social simulations\.InProceedings of the 43rd International Conference on Machine Learning \(ICML\),PMLR 306\.Cited by:[§1](https://arxiv.org/html/2606.00476#S1.p1.1),[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.00476#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2606.00476#S1.p3.2),[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Xie, Z\. Shi, H\. Shen, Y\. Ma, and X\. Jing \(2026\)M3\-BENCH: process\-aware evaluation of LLM agents’ social behaviors in mixed\-motive games\.arXiv preprint arXiv:2601\.08462\.Cited by:[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Zhou, H\. Zhu, L\. Mathur, R\. Zhang, H\. Yu, Z\. Qi, L\. Morency, Y\. Bisk, D\. Fried, G\. Neubig, and M\. Sap \(2024\)SOTOPIA: interactive evaluation for social intelligence in language agents\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.00476#S1.p1.1),[§1](https://arxiv.org/html/2606.00476#S1.p2.1),[§2](https://arxiv.org/html/2606.00476#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix APrompts

Table[3](https://arxiv.org/html/2606.00476#A1.T3)lists the verbatim system prompts for all five strategies \(the shared expert framing is elided with “…”\) together with an example of the per\-decision user prompt\. Only the system prompt varies across strategies; every strategy receives the same game\-state user prompt\.

BaselineYou are an expert Texas Hold’em poker player\. Your goal is to accumulate chips by making \+EV decisions\. You must respond with valid JSON only \-\-\- no reasoning, no explanation, no other text\. Example: \{"action": "call", "amount": 0\}CoT…Think step by step: assess your hand strength, estimate pot odds, consider opponents\. After your reasoning, state your conclusion on its own line in EXACTLY this format: DECISION: <fold\|check\|call\|raise\>\. Then on the very last line respond with JSON only\.PersonaYou are a tight\-aggressive \(TAG\) Texas Hold’em player…You play only premium hands aggressively, fold marginal hands, and apply pressure from position\. You must respond with valid JSON only…GTO\-constrained…Mentally estimate equity vs pot odds \(pot\_odds = call\_amount / \(pot \+ call\_amount\)\)\. Fold if equity < pot\_odds; call or raise if equity \> pot\_odds\. You must respond with valid JSON only…GTO\-verbose\(diagnostic\)…Show your work in EXACTLY these labeled lines, then the JSON: EQUITY: <0\.0\-\-1\.0\>; POT\_ODDS: <0\.0\-\-1\.0\>; RULE: fold if EQUITY < POT\_ODDS; raise if EQUITY \> POT\_ODDS \+ 0\.05; otherwise call; DECISION: <fold\|check\|call\|raise\>Example user prompt \(shared by all strategies\)Street: PREFLOP\. Your hole cards: A♠\\spadesuitK♠\\spadesuit\. Community cards: none\. Pot: 15\. Current bet to call: 10\. Your stack: 9990\. Players: AI\-1 stack=9990 active; AI\-2 stack=10000 folded; AI\-3 stack=9995 active\. Available actions: call \(10 chips\), fold, raise \(min 20 chips\)\.Table 3:System prompts for the five strategies and an example per\-decision user prompt\.

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Similar Articles

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Measuring AI Faithfulness-For Better or For Worse

Submit Feedback

Similar Articles

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Measuring AI Faithfulness-For Better or For Worse