Effective Explanations Support Planning Under Uncertainty
Summary
This research paper introduces a computational model that evaluates the effectiveness of procedural explanations by simulating how they guide action planning under uncertainty. Through four experiments, the authors demonstrate that explanations scored higher by their model are judged more helpful and lead to better navigation performance.
View Cached Full Text
Cached at: 05/12/26, 06:44 AM
# Effective Explanations Support Planning Under Uncertainty
Source: [https://arxiv.org/html/2605.08406](https://arxiv.org/html/2605.08406)
Hanqi Zhou∗,1,2,3,4,Britt Besch5,Charley M\. Wu2,3,4&Tobias Gerstenberg6 ∗hanqi\.zhou@uni\-tuebingen\.de 1University of Tübingen2Technical University Darmstadt3Hessian\.AI 4Max Planck Institute for Biological Cybernetics5University of Cambridge6Stanford University
###### Abstract
Explaining how to get from A to B can be challenging\. It requires mentally simulating what the listener will do based on what they are told\. To capture this process, we propose a computational model that converts utterances into action plans: a large language model translates an explanation into program\-like guidance \(a policy prior and value map\), and a planning agent executes it under partial observability\. We score explanations by the efficiency and reliability of the resulting paths, penalizing replanning\. Across four preregistered experiments, we collect a corpus of 1,200 explanations over 24 maps, elicit helpfulness judgments, measure baseline navigation, and test behavior with explanations of differing quality\. Higher\-scored explanations are judged more helpful and improve navigation: participants with explanations outperform those without, and high\-scoring explanations help more than low\-scoring ones\. Together, these results show procedural explanation as utility\-guided communication shaped by how language can be grounded into action under uncertainty\.
Keywords:explanation; navigation; planning; mental simulation; program induction
## Introduction
People routinely explain*how*to do things: how to find a building on a campus, how to assemble furniture, or how to troubleshoot a device\. Such procedural explanations are rarely exhaustive descriptions of all possible actions\. Instead, an effective explanation selects what matters and structures it in a way that enables the listener to act successfully\.
Viewed this way, procedural explanations are a special case of explanations aimed at supporting actions rather than mere belief change\. Existing work often emphasizes the role of such explanations in learning, by helping listeners infer the underlying causal structure that generates the observed data, allowing them to generalize their knowledge to new situations\[[18](https://arxiv.org/html/2605.08406#bib.bib5),[19](https://arxiv.org/html/2605.08406#bib.bib32),[17](https://arxiv.org/html/2605.08406#bib.bib31),[4](https://arxiv.org/html/2605.08406#bib.bib16)\]\. In many everyday settings, however, the central function is immediate action guidance under constraints: listeners have limited attention and memory, and speakers are uncertain about the context in which the explanation will be used\. On this view, explanation quality is not just measured by semantic adequacy or completeness, but also the pragmatic value in helping a listener make the right decisions\.
This perspective aligns with broader theories of communication as*utility\-guided inference*\[[23](https://arxiv.org/html/2605.08406#bib.bib27),[14](https://arxiv.org/html/2605.08406#bib.bib4), see\]\. In Gricean pragmatics, speakers cooperate by being informative while respecting conversational constraints\[[13](https://arxiv.org/html/2605.08406#bib.bib30), e\.g\., giving just enough information;\]\. Probabilistic approaches model this utterance choice as boundedly rational: speakers trade off expected communicative success against costs like utterance length or complexity\[[8](https://arxiv.org/html/2605.08406#bib.bib2)\]\. The Rational Speech Act \(RSA\) framework instantiates this idea by formalizing language use as probabilistic inference over speakers’ and listeners’ mental states\[[10](https://arxiv.org/html/2605.08406#bib.bib29),[12](https://arxiv.org/html/2605.08406#bib.bib28),[4](https://arxiv.org/html/2605.08406#bib.bib16)\]\. Complementing this approach, computational models of pedagogy show that teaching goals change what speakers choose to convey, and listeners can infer, yielding listener behavior that’s distinct from acting or observing\[[22](https://arxiv.org/html/2605.08406#bib.bib26),[2](https://arxiv.org/html/2605.08406#bib.bib25),[5](https://arxiv.org/html/2605.08406#bib.bib24),[25](https://arxiv.org/html/2605.08406#bib.bib21)\]\. Thus, a common theme is that communicative acts are shaped by their downstream consequences for an intended recipient\.
Here, we use navigation as a case study\. When one person explains how to get from A to B, the explainer knows the environment while the listener must act from the utterance plus local observations\[[11](https://arxiv.org/html/2605.08406#bib.bib20)\]\. Work on wayfinding shows that speakers adapt to this asymmetry by emphasizing landmarks, decision points, and hierarchical structure rather than every step\[[21](https://arxiv.org/html/2605.08406#bib.bib19),[9](https://arxiv.org/html/2605.08406#bib.bib18),[3](https://arxiv.org/html/2605.08406#bib.bib17)\]—suggesting a normative principle in which effective explanations allocate information to minimize expected downstream mistakes, not to maximize descriptive completeness\.
Figure 1:Modeling language\-guided navigation and experimental pipeline\.\(a\) Explanation collection \(Exp\. 1\): An explainer with full knowledge of the environment generates natural\-language explanations for an explainee acting under partial observability\. \(b\) Explanation modeling and selection: Free\-form text explanations are translated by LLMs into symbolic programs and evaluated by a simulated agent that plans and acts based on the symbolic program\. The performance of this agent is used to rate the explanation quality\. \(c\) Explanation evaluation: Model\-ranked explanations are evaluated via behavioral experiments measuring perceived helpfulness \(Exp\. 2\), baseline performance without explanation \(Exp\. 3\), and performance with bad/medium/good explanations \(Exp\. 4\)\.At the same time, navigation is an open\-ended instruction problem: environments can be large, and explainers can describe routes in many valid ways\. This open\-endedness has fueled interest in*machine\-generated explanations*from large language models \(LLMs\), which can produce fluent instructions on the fly\[[15](https://arxiv.org/html/2605.08406#bib.bib15),[20](https://arxiv.org/html/2605.08406#bib.bib6)\]\. Yet fluency is not enough—generated explanations can be causally wrong or emphasize details that do not help a listener act, motivating behavioral evaluation of whether they actually improve performance\. This opens an opportunity for cognitive science and AI to study how language is grounded into executable actions under uncertainty\[[6](https://arxiv.org/html/2605.08406#bib.bib23),[1](https://arxiv.org/html/2605.08406#bib.bib8)\], raising a central question:*what makes an explanation effective for guiding action*?
We address this question in a controlled navigation domain that isolates key ingredients of procedural explanation under partial observability\. Anexplainerhas global knowledge of the environment and goal, while anexplaineenavigates based on the explanation and their local observations\. Across four experiments \([Fig\. 1](https://arxiv.org/html/2605.08406#Sx1.F1)\), we collect a corpus of human\-written navigation explanations \(Exp\. 1\), measure perceived helpfulness \(Exp\. 2\), establish baseline navigation performance without explanations \(Exp\. 3\), and test how explanations affect navigation when available \(Exp\. 4\)\. We then evaluate a computational account in which explainers trade off expected gains in explainee success against communicative costs\. Bringing together what people say \(explanations\), what they think is helpful \(judgments\), and what they do \(navigation outcomes\), we characterize procedural explanation as a resource\-rational strategy for acting and communicating under uncertainty and limited shared knowledge\.
## Experiments
All four experiments were preregistered111See full materials here:[github\.com/cicl\-stanford/explaining\_how\_cogsci26](https://github.com/cicl-stanford/explaining_how_cogsci26)and conducted on Prolific\. We use a grid\-based “dungeon” navigation environment with 24 distinct maps, organized into 12 near\-matched pairs that differ only by small, localized edits \(see[Fig\. 2](https://arxiv.org/html/2605.08406#Sx2.F2)a for examples\)\. Each map has a start location and a hidden goal \(treasure\)\. An explainer sees the full map and writes a natural\-language message to guide a partner to the treasure\. The explainee either rates the message’s helpfulness or navigates under partial observability, viewing only a local region around their current position\.
Figure 2:Paired map examples and behavioral effects of explanation quality\.\(a\) Example map pairs: matched overall layouts with small local changes \(obstacles/structure\)\. \(b\) Path length \(top\) and helpfulness ratings \(bottom\) by condition \(None, Bad, Medium, Good\) for each map pair \(columns\)\. Bars show means±\\pmSE; dots show individual participants\. Higher\-quality explanations consistently increase perceived helpfulness, but path length gains vary across pairs\.Exp\. 1: Explanation collection\.To collect a dataset of explanations, we recruitedN=50N=50participants to act as explainers \([Fig\. 1](https://arxiv.org/html/2605.08406#Sx1.F1)a ➀\)\. On each trial, explainers saw the full map and wrote a free\-text message to help a partner find the treasure\. They were instructed: “Please send a message to your partner that will help them find the treasure\. Remember that your partner can only see the highlighted area – they cannot see the whole map\.” Each participant wrote explanations for 24 maps presented in randomized order\. We recorded explanation text and response times \(average completion25\.64±11\.7025\.64\\pm 11\.70minutes\)\. We performed minimal preprocessing on the free\-form explanations \(whitespace normalization; no content edits\)\.
Exp\. 2: Helpfulness judgments\.To acquire participant ratings about the helpfulness of explanations generated in Exp\. 1, we recruited a separate group ofN=50N=50participants\. For each map, we selected three explanations \(“good”, “medium”, “bad”\) following a model\-based ranking procedure \([Fig\. 1](https://arxiv.org/html/2605.08406#Sx1.F1)c ➁\)\. Participants viewed each map and rated how helpful each message was under the instruction “Please evaluate the following messages by rating how helpful they are for finding the treasure”\. Both maps and messages were presented in a randomized order \(avg\.22\.62±10\.6622\.62\\pm 10\.66mins\)\. We preregistered the hypothesis that participants’ helpfulness ratings would reflect the model\-based ranking order Good\>\>Medium\>\>Bad\.
Exp\. 3: Baseline navigation without explanations\.To provide a baseline measure of map difficulty and between\-map variability in path length, we recruitedN=50N=50participants to navigate each map under partial observability with the instruction “Find the treasure in as few steps as possible”\. The interface revealed a local field\-of\-view around the current position \([Fig\. 1](https://arxiv.org/html/2605.08406#Sx1.F1)c ➂\)\. We recorded participants’ trajectories and completion time \(avg\.10\.89±4\.8010\.89\\pm 4\.80mins\)\.
Exp\. 4: Navigation with explanations\.To test how explanation quality shapes navigation efficiency, we recruitedN=150N=150participants to navigate each map under partial observability \(with the same instruction as Exp\. 3\) but also receiving one explanation per map \([Fig\. 1](https://arxiv.org/html/2605.08406#Sx1.F1)c ➃, avg\.13\.46±5\.9213\.46\\pm 5\.92mins\)\. Explanation quality \(Good/Medium/Bad\) was manipulated within\-subjects using counterbalanced lists so each participant saw equal numbers of maps at each quality level\. Our preregistered hypothesis was that mean path length would follow the model ranking \(Good<<Medium<<Bad\) and that explanations would reduce path length overall \(With explanation<<No explanation\)\.
## Computational model
An explanation is useful to the extent that it can be translated into guidance that improves downstream navigation\. We model navigation as a POMDP parameterized by the fully observed worldww\(map, start, goal\): the agent takes actionsata\_\{t\}with transitionsP\(st\+1∣st,at,w\)P\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\},w\)over latent statessts\_\{t\}, observing only localoto\_\{t\}\[[16](https://arxiv.org/html/2605.08406#bib.bib10)\]\.
LLM translation as a stochastic compiler\.We use a large language model \(LLM\) to compile an explanationeeinto program\-like guidance for planning\. We compile explanations into a policy prior and value map,τ=\(π^τ\(a∣s\),V^τ\(s\)\)\\tau=\\big\(\\hat\{\\pi\}\_\{\\tau\}\(a\\mid s\),\\hat\{V\}\_\{\\tau\}\(s\)\\big\), whereπ^τ\\hat\{\\pi\}\_\{\\tau\}is a policy prior over actions andV^τ\\hat\{V\}\_\{\\tau\}is a value map over planner statesss\. The statess=ϕ\(o,w\)s=\\phi\(o,w\)are mapped onto observationsooin the current worldww\. Because LLM decoding is stochastic, we treat compilation as a conditional distributionp\(τ∣e,w\)p\(\\tau\\mid e,w\)\. In practice, we approximate this distribution by repeatedly running the same prompting procedure and obtainingKKcompilationsτ\(k\)∼𝒯\(e,w\)\\tau^\{\(k\)\}\\sim\\mathcal\{T\}\(e,w\), where𝒯\(e,w\)\\mathcal\{T\}\(e,w\)denotes the output of the LLM under this fixed prompt constructed from\(e,w\)\(e,w\)and fixed decoding settings\.
LetPlan\(τ,s\)\\textsc\{Plan\}\(\\tau,s\)denote the agent’s action\-selection rule given guidanceτ\\tauat statess\(e\.g\., usingπ^τ\\hat\{\\pi\}\_\{\\tau\}orV^τ\\hat\{V\}\_\{\\tau\}to bias search\)\. At timett,
at∼Plan\(τt,st\),st=ϕ\(ot,w\)\.\\displaystyle a\_\{t\}\\sim\\textsc\{Plan\}\(\\tau\_\{t\},s\_\{t\}\),\\qquad s\_\{t\}=\\phi\(o\_\{t\},w\)\.\(1\)If the current compilation fails to yield usable guidance \(e\.g\., invalid, or repeatedly dead\-ends\), the agent re\-queries the translator for a new sampleτt←𝒯\(e,w\)wheneverFail\(τt,ot\)=1\\tau\_\{t\}\\leftarrow\\mathcal\{T\}\(e,w\)\\quad\\text\{whenever \}\\textsc\{Fail\}\(\\tau\_\{t\},o\_\{t\}\)=1\.
Figure 3:Explanations improve navigation efficiency and subjective helpfulness\.\(a\) Example trajectories by condition: no explanation yields inefficient exploration; low\-quality explanations are vague procedural; high\-quality explanations emphasize relevant landmarks and structure, producing more direct paths\. \(b\) Path length by condition \(No/Bad/Medium/Good\): higher quality yields shorter paths and higher helpfulness ratings\. Bars show means; black lines bootstrapped 95% confidence intervals\.Utility trades off replanning, efficiency, and success\.We define the utility of an explanation as a function of replanning costsReplan\(due to ambiguous explanations\), efficiencyLenmin\\textsc\{Len\}\_\{\\min\}\(measured by the length of the explanation\), and downstream successSucc\(whether the goal is reached\), measured overNNindependent attempts:
Replan\(e,w\)\\displaystyle\\textsc\{Replan\}\(e,w\):=1N∑i=1NRi\(e,w\),\\displaystyle:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}R\_\{i\}\(e,w\),\(2\)Lenmin\(e,w\)\\displaystyle\\textsc\{Len\}\_\{\\min\}\(e,w\):=mini:Si\(e,w\)=1Li\(e,w\)\\displaystyle:=\\min\_\{i:\\,S\_\{i\}\(e,w\)=1\}L\_\{i\}\(e,w\)\(3\)Succ\(e,w\)\\displaystyle\\textsc\{Succ\}\(e,w\):=1N∑i=1NSi\(e,w\)\.\\displaystyle:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}S\_\{i\}\(e,w\)\.\(4\)For attempti∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}, the replanning cost of underspecified instructions is defined as the number of queriesRi\(e,w\)R\_\{i\}\(e,w\), efficiency is measured by the realized path lengthLi\(e,w\)L\_\{i\}\(e,w\), andSi\(e,w\)∈\{0,1\}S\_\{i\}\(e,w\)\\in\\\{0,1\\\}is an indicator of reaching the goal within the episode budget \(i\.e\., maximum number of steps\)\. If no attempt succeeds, we setLenmin\(e,w\)\\textsc\{Len\}\_\{\\min\}\(e,w\)to the episode budget to penalize failures\. These are then combined to define explanation utility as
U\(e,w\):=\\displaystyle U\(e,w\):=−αReplan\(e,w\)−βLenmin\(e,w\)\\displaystyle\-\\alpha\\,\\textsc\{Replan\}\(e,w\)\-\\beta\\,\\textsc\{Len\}\_\{\\min\}\(e,w\)\+γSucc\(e,w\),\\displaystyle\+\\gamma\\,\\textsc\{Succ\}\(e,w\),\(5\)withα,β,γ≥0\\alpha,\\beta,\\gamma\\geq 0\. Intuitively, effective procedural explanations minimize replanning, support efficient action, and robustly enable goal completion\.
Speaker model \(RSA\-style explanation choice\)\.Finally, we treat the explainer as choosing an explanation that maximizes expected utility for a recipient who will use LLM translation and planning\. The pragmatic speaker distribution is
S\(e∣w\)∝exp\(λU\(e,w\)\),\\displaystyle S\(e\\mid w\)\\propto\\exp\\left\(\\lambda\\,U\(e,w\)\\right\),\(6\)whereλ\\lambdais an inverse temperature parameter controlling how strongly the speaker favors high\-utility explanations\.
## Results
We first describe the explanation corpus and map structure\. We then test whether model\-ranked explanation quality predicts helpfulness judgments and navigation performance, and compare the full model to simpler baselines\.
### Descriptions of collected explanations
Corpus and maps description\.The final corpus from Exp\. 1 contained1,2001,200explanations across2424maps \(avg\.21\.16±13\.6821\.16\\pm 13\.68words\)\. Qualitatively, explanations ranged from value\-focused goal descriptions \(“top\-left corner”\), or policy\-focused descriptions of either step\-by\-step action sequences \(“go up twice”\) or high\-level contingencies \(“if you see X, do Y”\)\. In a keyword\-based coding \(e\.g\., “center” for value, “steps” for low\-level policy\), both types were often mentioned non\-exclusively, such that mixed explanations were most common \(value:75\.4%75\.4\\%; low\-level policy:74\.7%74\.7\\%; high\-level policy:96\.8%96\.8\\%; both:58\.3%58\.3\\%\)\.
To relate explanation style to map structure, we computed graph properties of the reachable state space from the start: \(i\) shortest\-path distance to the goal; \(ii\) brittleness, measured by the fraction of dead ends \(degree=1=1\) and corridors \(degree=2=2\); \(iii\) and openness, measured by the fraction of emptiness \(degree≥3\\geq 3\)\. Across maps \(n=24n=24\), explanation length was unrelated to shortest\-path distance \(Spearmanρ=\.08\\rho=\.08,p=\.69p=\.69\), but increased strongly with brittleness \(ρ=\.73\\rho=\.73,p<\.001p<\.001\), whereby maps with more corridors and dead ends elicited longer explanations\. Intuitively, dead ends and narrow corridors make deviations costly, encouraging more explicit, step\-by\-step guidance\. Oppositely, maps with more decision points had shorter explanations \(ρ=−\.74\\rho=\-\.74,p<\.001p<\.001\), consistent with the idea that when many alternative routes are available, goal\-level guidance suffices because mistakes are easy to recover from\.
Stimulus selection for Experiments 2–4\.To select stimuli with controlled quality differences, we used the open\-sourced DeepSeekR1 \(70B\) as the LLM𝒯\(e,w\)\\mathcal\{T\}\(e,w\)\. We ranked explanations and defined*Good*,*Medium*, and*Bad*bins using quantiles ofU\(e\)U\(e\)\(higher is better\)\. This gave us a model\-based ordering that we tested against human judgments of explanation quality \(Exp\. 2\) and navigation performance \(Exp\. 3\-4\)\. This resulted in7272explanations \(33per map\), with length20\.98±14\.7820\.98\\pm 14\.78words\.
### Behavioral analysis
To evaluate model\-ranked explanations, we analyzed both subjective judgments of helpfulness \(Exp\. 2\) and objective navigation performance \(Exp\. 3\-4\)\. Across analyses, we test whether explanation quality \(Eq\.[5](https://arxiv.org/html/2605.08406#Sx3.E5)\) corresponded to human judgments and produced measurable improvements in navigation performance under partial observability\.
Explanation utility aligns with human judgments\.We first tested whether model\-categorized explanation quality predicted perceived helpfulness\.[Fig\. 3](https://arxiv.org/html/2605.08406#Sx3.F3)b shows averaged ratings by explanation quality\. Consistent with preregistered predictions, explanations labeled as*good*were rated as more helpful than*medium*explanations, which in turn were rated higher than*bad*explanations, yielding a clear monotonic ordering \(One\-sided Welch’s t\-tests: Good\>\>Medium:t\(49\)=13\.25,p<\.001,d=1\.86t\(49\)=13\.25,p<\.001,d=1\.86; Medium\>\>Bad:t\(49\)=18\.48,p<\.001,d=2\.59t\(49\)=18\.48,p<\.001,d=2\.59\)\. A Bayesian linear mixed\-effects model with ordered quality \(Bad=0=0, Medium=1=1, Good=2=2\) and random intercepts for participants and maps confirmed this monotonic increase \(β=17\.87\\beta=17\.87, 95% CI \[17\.0217\.02,18\.7418\.74\]\): explanations predicted to help a partially informed listener are consistently judged as more helpful\.
Figure 4:Model comparison and component ablations\.\(a\) Points show participants’ helpfulness ratings by full\-model utility score; the line shows the predicted–subjective helpfulness regression\. \(b\) Leave\-one\-component\-out ablations\. Bars show the likelihood\-ratioχ2\\chi^\{2\}statistic versus the full model; larger values indicate greater loss of fit\. The dashed line marks thep=\.05p=\.05threshold; bars above it indicate significant loss of fit after predictor removal\.Higher\-quality explanations produce more efficient navigation\.We tested whether navigation efficiency varies with explanation quality\. Relative to no\-explanation trials, providing explanations substantially reduced path length overall \(β=−7\.73\\beta=\-7\.73, 95% CI \[−9\.05\-9\.05,−6\.12\-6\.12\]\), showing that even imperfect guidance helps under partial observability\. Within explanation trials, we fit a Bayesian mixed\-effects model with explanation quality entered as an ordinal score \(Bad=0=0, Medium=1=1, Good=2=2\) and random intercepts for participants and maps\. Higher\-quality explanations predicted shorter paths \(β=−3\.65\\beta=\-3\.65, 95% CI \[−4\.25\-4\.25,−3\.00\-3\.00\]\)\.
### Computational model evaluation
We next evaluated whether the full utility model predicts human helpfulness judgments better than simpler alternatives\. We scored explanations by simulating how a partially informed listener would translate the explanation into executable guidance and then act\.
We compared the full utility model to two variants\.Length\-only:As a simple processing heuristic, we scored explanations by negative word count,ULen\(e\)∝−Len\(e\)U\_\{\\textsc\{Len\}\}\(e\)\\propto\-\\,\\textsc\{Len\}\(e\)\.Direct\-action \(non\-program\):To test whether an explicit program\-like intermediate representation is necessary, we prompted the LLM to output the agent’s next action directly from the current observation and explanation, without an explicit policy and value representation, and without a replanning loop\. We evaluated each explanation overNNattempts per map and scored it using success and path length \(capped on failure\), but*without*replanning cost:UDirect\(e,w\):=δSucc\(e,w\)−α𝔼\[L\(e,w\)\]U\_\{\\textsc\{Direct\}\}\(e,w\):=\\delta\\,\\textsc\{Succ\}\(e,w\)\-\\alpha\\,\\mathbb\{E\}\[L\(e,w\)\], with fixedα,δ≥0\\alpha,\\delta\\geq 0across models\.
Analyzing what makes a good explanation\.For comparability, we normalized each model’s score to\[0,1\]\[0,1\]within each map and analyzed only explanations that were rated by participants\. We then fitted a mixed\-effects regression, predicting helpfulness from the full model’s utility score, a length\-only score, and a direct\-action score entered simultaneously, with random intercepts for participants and maps\. The utility score uniquely predicted higher helpfulness,βUtil=38\.47\\beta\_\{\\textsc\{Util\}\}=38\.47, 95% CI=\[36\.24,40\.17\]=\[36\.24,40\.17\]\. In contrast, the length\-only and direct\-action predictors were weaker \(βLen=\.85\\beta\_\{\\textsc\{Len\}\}=\.85, 95% CI=\[−1\.23,2\.94\]=\[\-1\.23,2\.94\],βDirect=\.08\\beta\_\{\\textsc\{Direct\}\}=\.08, 95% CI=\[−2\.10,2\.28\]=\[\-2\.10,2\.28\]\)\.[Fig\. 4](https://arxiv.org/html/2605.08406#Sx4.F4)a shows helpfulness as a function of utility, with the partial regression line from the joint model holding the other predictors at their means\. Overall, helpfulness reflects more than brevity: participants preferred explanations that the utility model predicts will support reliable and efficient guidance\. Leave\-one\-component\-out ablations further showed that all three utility terms contributed to predicting judgments \([Fig\. 4](https://arxiv.org/html/2605.08406#Sx4.F4)b\)\. RemovingReplanorSuccproduced the largest losses, suggesting that people penalize explanations requiring frequent reinterpretation or yielding unreliable guidance; removing path efficiency had a smaller but reliable effect\.
Figure 5:Policy vs\. value content in explanations\.\(a\) Navigation path length for explanations that contain policy guidance vs\. those that do not\. \(b\) Mean helpfulness ratings by map difficulty for explanations containing policy\+value information, policy\-only information, or value\-only information\. Error bars show standard errors\.Linguistic strategies and structure\.We analyzed rated explanations to characterize strategies associated with higher utility\. We coded whether an explanation primarily provided a*policy*\(procedural actions\), a*value*signal \(goal\- or landmark\-oriented direction\), or both\. Policy information improved navigation efficiency: trials with policy content yielded shorter paths than trials without policy content \([Fig\. 5](https://arxiv.org/html/2605.08406#Sx4.F5)a;25\.09±7\.3625\.09\\pm 7\.36vs\.33\.83±12\.5233\.83\\pm 12\.52;t=3\.20t=3\.20,p<\.005p<\.005,d=\.851d=\.851\)\. However, helpfulness depended on combining strategies\. Explanations integrating policy and value cues were rated highest overall \(policy\+value:64\.23±31\.5064\.23\\pm 31\.50; policy\-only:62\.47±30\.4162\.47\\pm 30\.41; value\-only:24\.99±25\.8024\.99\\pm 25\.80\), although the difference between policy\+value and policy\-only was not significant \(t=1\.614t=1\.614,p=\.106p=\.106\)\. Importantly, the advantage of combining policy and value increased with map difficulty \([Fig\. 5](https://arxiv.org/html/2605.08406#Sx4.F5)b\): policy\-only guidance degraded on hard maps, whereas policy\+value remained robust\. Overall, these analyses provide converging evidence that our model’s notion of communicative utility—grounded in executable guidance for a partially informed listener—captures both the*behavioral*and*linguistic*signatures of what people consider to be good explanations of how to do something\.
Table 1:Failure modes of LLM translation\.Counts \(nn\), mean LLM success rate \(Succ\.\), mean word length \(Len\.\), and mean number of direction words \(Dir\.\) per explanation\. Categories are non\-exclusive\.Failure modes of LLM translation\.We next characterized*why*certain explanations fail\. Coding explanations by length, direction\-word density, plan failure rate, and lexical cues produced four non\-exclusive categories \([Tab\. 1](https://arxiv.org/html/2605.08406#Sx4.T1)\)\.*Direction overload*consists of turn\-by\-turn directives \(“go up 3, then right 2, then down 4”\) that destabilize the planner—a single misaligned step cascades\.*Overcomplicated*explanations are long multi\-clause instructions whose conditional structure \(“if you see X, then Y, unless Z”\) exceeds what the LLM compiler can reliably parse\.*Overly compressed*explanations which omit needed details and induce uninformed search\.*Spatial ambiguity*names a region without a referable landmark \(“somewhere near the middle”\), forcing the planner into repeated re\-queries\. The hardest map \(Map 10,4\.7%4\.7\\%success\) was dominated by direction overload, as long corridors amplified the cost of a single misaligned step\. More broadly, the two most frequent failure modes fall at opposite ends of length and directive density, suggesting that failure reflects not simply being too terse or too verbose, but a mismatch between explanation form and actionable guidance\.
Human judgments and LLM translation success can also diverge\. Using a top\-quartile cutoff \(≥0\.7\\geq 0\.7LLM success;≥70\\geq 70helpfulness\),1919explanations are*LLM\-only*and1515are*human\-only*\.*Human\-only*cases are more than twice as long \(35\.135\.1vs\.16\.816\.8words\), use more direction words \(5\.15\.1vs\.2\.52\.5\), and more often include explicit step counts \(60%60\\%vs\.16%16\\%\): the same surface features humans read as conscientious guidance overload the translator with brittle commitments\.*LLM\-only*explanations are terse, landmark\-anchored statements that compile cleanly but read as under\-informative\. This double dissociation suggests that bridging the gap requires translators that recover intent from verbose human directives, not speakers writing for the machine\.
## Discussion
Procedural explanations are designed to*produce competent action*in a listener who lacks full information, not merely describe the world\. In a partially observable navigation task, explanations ranked by our utility model received higher helpfulness ratings and yielded shorter paths than no\-explanation baselines, with monotonic gains from Bad to Medium to Good—supporting a behavioral criterion for explanation quality: an explanation is good insofar as it grounds into*executable guidance*that supports planning under uncertainty\. Our framework builds on the RSA intuition that utterances maximize communicative utility, but shifts the target from information gain to downstream behavior: we evaluate an explanation by simulating a partially informed agent that stochastically translates language into guidance for planning and navigation\. This predicts that models capturing the translation\-and\-planning bottleneck should match human judgments better than surface heuristics or planning\-free baselines—which our results support\. The full model outperformed both a length\-only heuristic \(so participants do not simply prefer shorter explanations\) and a direct\-action baseline that skips program\-like guidance \(so modeling*how*explanations shape sequences of decisions matters\)\.
Our linguistic analyses clarify what executable guidance looks like\. Purely procedural, step\-by\-step explanations can improve efficiency when they remain aligned with the listener’s local context, but are brittle under partial observability\. By contrast, value\- or landmark\-oriented information helps the listener re\-localize and replan when things go awry\. Consistent with this account, explanations integrating procedural steps with goal structure were rated highest overall, and yield larger benefits with increasing map difficulty\. This supports an action\-under\-uncertainty view of instruction giving: good explanations balance specificity \(to reduce deliberation\) with stable reference points \(to avoid replanning\)\.
Several limitations remain\. First, we used simplified grid worlds to provide a controlled testbed\. However, real\-world instructions involve hierarchical goals\[[7](https://arxiv.org/html/2605.08406#bib.bib11)\]and interactive replanning\[[24](https://arxiv.org/html/2605.08406#bib.bib9)\]\. Second, our listener model is largely static\. Aside from partial observability, its physical capabilities \(what actions are feasible\) and knowledge \(what landmarks or conventions are shared\) do not change\. In natural settings, effective explainers tailor guidance by inferring what a listener can do and what they already know\[[25](https://arxiv.org/html/2605.08406#bib.bib21)\]\. A natural extension is to model listener knowledge as a latent state inferred online from behavior\[[26](https://arxiv.org/html/2605.08406#bib.bib13)\], enabling explanations that adapt to individual capabilities in more open\-ended environments\.
Overall, the results support a view of procedural explanation as utility\-guided communication grounded in action: explanations are effective when they reliably translate into plans that work for a partially informed listener, minimize replanning, and support efficient goal achievement\.
## Acknowledgments
We thank Robert Hawkins and Lio Wong for helpful discussions\. The authors thank the International Max Planck Research School for Intelligent Systems \(IMPRS\-IS\) for supporting HZ\. HZ and CMW are supported by the European Research Council \(ERC\) under the European Union’s Horizon 2020 research and innovation programme \(C4C^\{4\}: 101164709\), the Hessian research funding programme LOEWE/4b//519/05/01\.002\(0022\)/119, the Deutsche Forschungsgemeinschaft \(German Research Foundation, DFG\) under Germany’s Excellence Strategy \(EXC 3066/1 “The Adaptive Mind”, Project No\. 533717223\), and the Excellence Cluster “Reasonable AI” by the Deutsche Forschungsgemeinschaft \(German Research Foundation, DFG\) under Germany’s Excellence Strategy – EXC\-3057\. TG was supported by grants from the Stanford Institute for Human\-Centered Artificial Intelligence \(HAI\) and from the Cooperative AI Foundation\. BB was supported by the Studienstiftung des deutschen Volkes \(German Academic Scholarship Foundation\) for her research stay at Stanford\.
## References
- \[1\]M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, C\. Fu, K\. Gopalakrishnan, K\. Hausman,et al\.\(2022\)Do as i can, not as i say: grounding language in robotic affordances\.arXiv preprint arXiv:2204\.01691\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p5.1)\.
- \[2\]\(2019\)Learning how to generalize\.Cognitive science43\(8\),pp\. e12777\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[3\]A\. Baltaretu, E\. Krahmer, and A\. Maes\(2015\)Improving route directions: the role of intersection type and visual clutter for spatial reference\.Applied Cognitive Psychology29\(5\),pp\. 647–660\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p4.1)\.
- \[4\]K\. Chandra, T\. Chen, T\. Li, J\. Ragan\-Kelley, and J\. Tenenbaum\(2024\)Cooperative explanation as rational communication\.InProceedings of the annual meeting of the cognitive science society,Vol\.46\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p2.1),[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[5\]A\. M\. Chen, A\. Palacci, N\. Vélez, R\. D\. Hawkins, and S\. J\. Gershman\(2024\)A hierarchical bayesian model of adaptive teaching\.Cognitive science48\(7\),pp\. e13477\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[6\]D\. Chen and R\. Mooney\(2011\)Learning to interpret natural language navigation instructions from observations\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.25,pp\. 859–865\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p5.1)\.
- \[7\]C\. Colas, T\. Mills, B\. Prystawski, M\. H\. Tessler, N\. Goodman, J\. Andreas, and J\. Tenenbaum\(2025\)Language and experience: a computational model of social learning in complex tasks\.arXiv preprint arXiv:2509\.00074\.Cited by:[Discussion](https://arxiv.org/html/2605.08406#Sx5.p3.1)\.
- \[8\]J\. Degen\(2023\)The rational speech act framework\.Annual Review of Linguistics9,pp\. 519–540\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[9\]M\. Denis and J\. M\. Loomis\(2007\)Perspectives on human spatial cognition: memory, navigation, and environmental learning\.Psychological Research71\(3\),pp\. 235–239\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p4.1)\.
- \[10\]M\. C\. Frank and N\. D\. Goodman\(2012\)Predicting pragmatic reasoning in language games\.Science336\(6084\),pp\. 998–998\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[11\]D\. Fried, R\. Hu, V\. Cirik, A\. Rohrbach, J\. Andreas, L\. Morency, T\. Berg\-Kirkpatrick, K\. Saenko, D\. Klein, and T\. Darrell\(2018\)Speaker\-follower models for vision\-and\-language navigation\.Advances in neural information processing systems31\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p4.1)\.
- \[12\]N\. D\. Goodman and M\. C\. Frank\(2016\)Pragmatic language interpretation as probabilistic inference\.Trends in cognitive sciences20\(11\),pp\. 818–829\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[13\]H\. P\. Grice\(1975\)Logic and conversation\.InSpeech acts,pp\. 41–58\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[14\]J\. Harding, T\. Gerstenberg, and T\. Icard\(2025\)A Communication\-First Account of Explanation\.arXiv\.External Links:[Link](https://arxiv.org/abs/2505.03732)Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[15\]B\. Joshi, Z\. Liu, S\. Ramnath, A\. Chan, Z\. Tong, S\. Nie, Q\. Wang, Y\. Choi, and X\. Ren\(2023\)Are machine rationales \(not\) useful to humans? measuring and improving human utility of free\-text rationales\.arXiv preprint arXiv:2305\.07095\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p5.1)\.
- \[16\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[Computational model](https://arxiv.org/html/2605.08406#Sx3.p1.5)\.
- \[17\]L\. Kirfel, T\. Icard, and T\. Gerstenberg\(2022\)Inference from explanation\.\.Journal of Experimental Psychology: General151\(7\),pp\. 1481\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p2.1)\.
- \[18\]T\. Lombrozo\(2009\)Explanation and categorization: how “why?” informs “what?”\.Cognition110\(2\),pp\. 248–253\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p2.1)\.
- \[19\]T\. Lombrozo\(2012\)14 explanation and abductive inference\.The Oxford handbook of thinking and reasoning,pp\. 260\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p2.1)\.
- \[20\]S\. Saha, P\. Hase, and M\. Bansal\(2023\)Can language models teach weaker agents? teacher explanations improve students via theory of mind\.arXiv preprint arXiv:2306\.09299\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p5.1)\.
- \[21\]A\. Schwering, J\. Krukar, R\. Li, V\. J\. Anacta, and S\. Fuest\(2017\)Wayfinding through orientation\.Spatial Cognition & Computation17\(4\),pp\. 273–303\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p4.1)\.
- \[22\]P\. Shafto, N\. D\. Goodman, and T\. L\. Griffiths\(2014\)A rational account of pedagogical reasoning: teaching by, and learning from, examples\.Cognitive psychology71,pp\. 55–89\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[23\]T\. R\. Sumers, M\. K\. Ho, T\. L\. Griffiths, and R\. D\. Hawkins\(2024\)Reconciling truthfulness and relevance as epistemic and decision\-theoretic utility\.\.Psychological review131\(1\),pp\. 194\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1)\.
- \[24\]N\. Tomlin, N\. Zhou, E\. Fleisig, L\. Chen, T\. Wright, L\. Vinh, L\. X\. Ma, S\. Eisape, E\. French, T\. Du,et al\.\(2025\)Characterizing language use in a collaborative situated game\.arXiv preprint arXiv:2512\.03381\.Cited by:[Discussion](https://arxiv.org/html/2605.08406#Sx5.p3.1)\.
- \[25\]N\. Vélez, A\. M\. Chen, T\. Burke, F\. A\. Cushman, and S\. J\. Gershman\(2023\)Teachers recruit mentalizing regions to represent learners’ beliefs\.Proceedings of the National Academy of Sciences120\(22\),pp\. e2215015120\.Cited by:[Introduction](https://arxiv.org/html/2605.08406#Sx1.p3.1),[Discussion](https://arxiv.org/html/2605.08406#Sx5.p3.1)\.
- \[26\]H\. Zhou, R\. Bamler, C\. M\. Wu, and A\. Tejero\-Cantero\(2024\)Predictive, scalable and interpretable knowledge tracing on structured domains\.InThe Twelfth International Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2403.13179)Cited by:[Discussion](https://arxiv.org/html/2605.08406#Sx5.p3.1)\.Similar Articles
Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI
This paper evaluates explainability methods in safety-critical Automatic Target Recognition (ATR) systems, highlighting the limitations of post-hoc techniques like saliency and attention maps. It proposes a taxonomy and assessment framework to address issues such as spurious explanations and instability, advocating for more robust, causally grounded XAI approaches.
Rethinking Experience Utilization in Self-Evolving Language Model Agents
This paper introduces ExpWeaver, a framework that optimizes how self-evolving language model agents utilize past experiences during runtime decision-making. It demonstrates that selectively invoking experience based on reasoning uncertainty improves performance across various environments and models.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
This research paper analyzes LLM reasoning traces in the game four-in-a-row, finding that LLMs exhibit myopic planning where performance is driven by shallow search breadth rather than deep lookahead, unlike human experts.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.
Improving mathematical reasoning with process supervision
OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.