Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

arXiv cs.CL Papers

Summary

This paper investigates whether high-quality Natural Language Explanations (NLEs) generated by LLMs from XAI outputs actually improve task performance, finding they do not aid accuracy but inflate confidence, revealing a quality-usefulness gap.

arXiv:2605.26770v1 Announce Type: new Abstract: Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:10 AM

# Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids
Source: [https://arxiv.org/html/2605.26770](https://arxiv.org/html/2605.26770)
Fabian Lukassen1,Jan Herrmann2,Christoph Weisser3, Alexander Silbersdorff1,Benjamin Saefken4,Thomas Kneib1 1University of Göttingen2BASF SE3Hochschule Bielefeld4TU Clausthal fabian\.lukassen@stud\.uni\-goettingen\.de,jan\.herrmann@basf\.com, christoph\.weisser@hsbi\.de,asilbersdorff@uni\-goettingen\.de, benjamin\.saefken@tu\-clausthal\.de,tkneib@uni\-goettingen\.de

###### Abstract

Prior work shows that Large Language Models \(LLMs\) can transform Explainable AI \(XAI\) outputs into Natural Language Explanations \(NLEs\) that score highly on quality metrics such as plausibility, coherence, and comprehensibility\. But does explanationqualitytranslate to practicalusefulness? We investigate this question in a time\-series energy forecasting domain through five controlled experiments \(2,730 judgments across 60 test instances\), each operationalising a distinct facet of usefulness studied in the XAI literature\. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self\-reported confidence\. A placebic control shows that this confidence boost is driven by textpresencerather than content\. In an out\-of\-distribution detection task, NLEs reduce the LLM judge’s ability to flag unreliable predictions, providing false reassurance that masks model failure\. We characterise these findings as theQuality\-Usefulness Gapand argue that evaluation of the XAI\-to\-NLE pipeline must extend beyond text\-quality metrics to downstream task performance\.111[Code and data URL](https://github.com/fabian-lu/quality-usefulness-gap)\.

Quality Without Usefulness: LLM\-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

Fabian Lukassen1, Jan Herrmann2, Christoph Weisser3,Alexander Silbersdorff1,Benjamin Saefken4,Thomas Kneib11University of Göttingen2BASF SE3Hochschule Bielefeld4TU Clausthalfabian\.lukassen@stud\.uni\-goettingen\.de,jan\.herrmann@basf\.com,christoph\.weisser@hsbi\.de,asilbersdorff@uni\-goettingen\.de,benjamin\.saefken@tu\-clausthal\.de,tkneib@uni\-goettingen\.de

## 1Introduction

Post\-hoc XAI methods such as SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.26770#bib.bib37)\)produce feature\-attribution outputs that presuppose statistical literacy the intended audiences – domain experts, decision makers, regulators – typically lack\(Arrieta et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib3); Miller,[2019](https://arxiv.org/html/2605.26770#bib.bib39)\)\. Natural Language Explanations \(NLEs\) generated by Large Language Models \(LLMs\) translate those outputs into prose accessible to non\-experts\. A growing body of work now builds such NLE\-from\-XAI pipelines across tabular\(Martens et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib38); Zytek et al\.,[2024b](https://arxiv.org/html/2605.26770#bib.bib57); Dwiyanti et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib16); Swamy et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib52)\), graph\(Cedro and Martens,[2025](https://arxiv.org/html/2605.26770#bib.bib11)\), and time\-series\(Aksu et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib2)\)settings, all reporting consistently high quality scores\. A recent factorial study\(Lukassen et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib35)\)confirms this at scale – across 4 ML models, 3 XAI conditions, 3 LLMs, and 8 prompting strategies \(660 NLEs\), G\-Eval scores reach 4\.0–4\.8/5\. But quality there was measured on text\-artefact properties only: plausibility, coherence, comprehensibility\.

![Refer to caption](https://arxiv.org/html/2605.26770v1/hook_v2c_trimmed.png)Figure 1:An LLM narrates a prediction and XAI attributions into a high\-quality NLE\. We test whether such NLEs help downstream decisions\.Quality, however, is not utility\. Research on the visual XAI outputs that these NLE pipelines build on – feature\-importance charts, saliency maps, attribution plots – has repeatedly shown that they often fail to improve, and sometimes harm, downstream decision\-making\(Bansal et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib5); Jesus et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib29); Schemmer et al\.,[2022](https://arxiv.org/html/2605.26770#bib.bib46); Buçinca et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib9)\)\. That work draws a distinction between explanation*quality*\(text\-artefact properties;Nauta et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib40)\) and*usefulness*\(measurable impact on downstream decisions – simulatability, task performance, trust calibration;Doshi\-Velez and Kim,[2017](https://arxiv.org/html/2605.26770#bib.bib15); Jacovi et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib28)\)\. This distinction has not been carried over to LLM\-generated NLEs\.

We test it\. Holding NLE quality constant at the levels established byLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\), we run five controlled experiments, each covering a distinct facet of usefulness studied in the XAI literature, on a household energy forecasting task\. Across 2,730 LLM\-judge judgments, NLEs fail to improve accuracy on any of the five tasks; in the most consequential setting – detecting out\-of\-distribution inputs – they appear to*reduce*the judge’s ability to flag unreliable predictions\. We characterise this pattern as theQuality\-Usefulness Gap\.

#### Contributions:

\(1\) Empirical evidence for the Quality\-Usefulness Gap in LLM\-generated NLEs from XAI outputs\. \(2\) Diagnosis of two false\-reassurance mechanisms behind that pattern:*confidence inflation*driven by text presence rather than content \(Experiments E1, E2\), and*rationalisation*of anomalous inputs \(E5\)\. \(3\) Together, these motivate a broader argument that NLE evaluation must move beyond quality metrics to task\-based usefulness measures, just as standard XAI evaluation has done\.

## 2Related Work

#### Post\-hoc XAI\.

SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.26770#bib.bib37)\), the attribution method used throughout this paper, grounds per\-feature importances in Shapley values from cooperative game theory: for predictionf​\(x\)f\(x\)with feature setℱ\\mathcal\{F\}, each feature receives

ϕi=∑S⊆ℱ∖\{i\}\|S\|\!​\(\|ℱ\|−\|S\|−1\)\!\|ℱ\|\!​\[f​\(xS∪\{i\}\)−f​\(xS\)\],\\phi\_\{i\}=\\sum\_\{S\\subseteq\\mathcal\{F\}\\setminus\\\{i\\\}\}\\tfrac\{\|S\|\!\(\|\\mathcal\{F\}\|\-\|S\|\-1\)\!\}\{\|\\mathcal\{F\}\|\!\}\\,\[f\(x\_\{S\\cup\\\{i\\\}\}\)\-f\(x\_\{S\}\)\],\(1\)satisfyingf​\(x\)=ϕ0\+∑iϕif\(x\)=\\phi\_\{0\}\+\\sum\_\{i\}\\phi\_\{i\}; TreeSHAP\(Lundberg et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib36)\)computes this exactly in polynomial time for tree models\. SHAP is widely adopted\(Guidotti et al\.,[2018](https://arxiv.org/html/2605.26770#bib.bib22)\)but requires statistical fluency to interpret, particularly in time\-series settings with autoregressive lags\(Theissler et al\.,[2022](https://arxiv.org/html/2605.26770#bib.bib53); Zytek et al\.,[2024a](https://arxiv.org/html/2605.26770#bib.bib56)\)\.

#### Quality without utility in standard XAI\.

A growing literature finds that visual XAI outputs frequently do not improve decision\-making and sometimes harm it\.Bansal et al\. \([2021](https://arxiv.org/html/2605.26770#bib.bib5)\)report that explanations increase acceptance of incorrect AI recommendations without lifting team performance;Jesus et al\. \([2021](https://arxiv.org/html/2605.26770#bib.bib29)\)find LIME/SHAP lead to lower accuracy than raw data alone;Im et al\. \([2023](https://arxiv.org/html/2605.26770#bib.bib27)\)extend this to oracle saliency;Schemmer et al\. \([2022](https://arxiv.org/html/2605.26770#bib.bib46)\)show that explanations do not produce appropriate reliance\. The mechanism is overreliance\(Buçinca et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib9); Bucinca et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib8); Chen et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib13)\)\. In the placebic\-explanation thread,Shymanski et al\. \([2025a](https://arxiv.org/html/2605.26770#bib.bib47),[b](https://arxiv.org/html/2605.26770#bib.bib48)\)show that users rate placebic and actionable explanations as equally satisfying;Ajwani et al\. \([2024](https://arxiv.org/html/2605.26770#bib.bib1)\)andFan et al\. \([2026](https://arxiv.org/html/2605.26770#bib.bib17)\)document how LLM explanations sustain trust in incorrect outputs through fluency and framing;Spillner et al\. \([2026](https://arxiv.org/html/2605.26770#bib.bib50)\)show that self\-reported trust dissociates from behavioural reliance\. The constructs*quality*\(Nauta et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib40); Naveed et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib41)\)and*usefulness*\(Doshi\-Velez and Kim,[2017](https://arxiv.org/html/2605.26770#bib.bib15); Jacovi et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib28)\)are explicitly distinguished but rarely jointly measured\.

#### LLM\-generated NLEs from XAI outputs\.

Recent systems convert SHAP attributions into prose: XAIstories\(Martens et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib38)\), Explingo\(Zytek et al\.,[2024b](https://arxiv.org/html/2605.26770#bib.bib57)\), ContextualSHAP\(Dwiyanti et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib16)\), GraphXAIN for graph models\(Cedro and Martens,[2025](https://arxiv.org/html/2605.26770#bib.bib11)\), the social\-science\-grounded iLLuMinaTE\(Swamy et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib52)\), and the time\-series–specific XForecast\(Aksu et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib2)\)\. All evaluate exclusively on quality proxies – subjective ratings, comprehension surveys, automated grading\.Lukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\)follow the same convention, reaching 4–5/5 quality across a factorial design\.Whether high NLE quality translates to downstream usefulness has not been tested\.

#### LLM\-as\-judge\.

LLM judges are now standard for open\-ended generation evaluation where BLEU/ROUGE fail\(Liu et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib34); Zheng et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib55); Gu et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib20)\), despite documented systematic biases – position bias, verbosity bias, and self\-preference \(a judge favouring outputs from its own model family\)\(Zheng et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib55); Gu et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib21)\)\.Liu et al\. \([2023](https://arxiv.org/html/2605.26770#bib.bib34)\)introduced G\-Eval, whose chain\-of\-thought protocol we adapt below\. In the XAI domain,Bona et al\. \([2024](https://arxiv.org/html/2605.26770#bib.bib7)\)show that LLM judges replicate human conclusions on coarse\-grained qualities but are weaker on numerical verification\.

#### Positioning\.

We ask whether the quality\-versus\-usefulness gap documented for visual XAI also holds for LLM\-generated NLEs\. Holding NLE quality constant at the levels established inLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\), we test five distinct usefulness constructs under controlled conditions\.

## 3Methodology

We build on the prediction\-to\-NLE pipeline ofLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\), fixed here to the configuration that study identified as highest\-performing and most efficient: XGBoost \+ SHAP TreeExplainer \+ zero\-shot prompting, with GPT\-4o and DeepSeek\-R1 as generators and judges\. Holding quality constant at the levels established there makes NLE usefulness the sole variable of interest\.

Algorithm 1NLE Generation and Evaluation1:Test set

𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}\(

N=60N\{=\}60\), XGBoost

ff, SHAP explainer

χ\\chi
2:Generators

𝒢=\{GPT\-4o, DeepSeek\-R1\}\\mathcal\{G\}\{=\}\\\{\\text\{GPT\-4o, DeepSeek\-R1\}\\\}
3:Judges

𝒥=\{GPT\-4o, DeepSeek\-R1\}\\mathcal\{J\}\{=\}\\\{\\text\{GPT\-4o, DeepSeek\-R1\}\\\}
4:Judgment corpus

ℛ\\mathcal\{R\}
5:// Phase 1: NLE Generation \(shared across experiments\)

6:forall instances

𝐱t∈𝒟test\\mathbf\{x\}\_\{t\}\\in\\mathcal\{D\}\_\{\\text\{test\}\}do

7:

y^t←f​\(𝐱t\)\\hat\{y\}\_\{t\}\\leftarrow f\(\\mathbf\{x\}\_\{t\}\);

ϕt←χ​\(f,𝐱t\)\\boldsymbol\{\\phi\}\_\{t\}\\leftarrow\\chi\(f,\\mathbf\{x\}\_\{t\}\)
8:forall generators

g∈𝒢g\\in\\mathcal\{G\}do

9:

Etg←ZeroShot​\(g;𝐱t,y^t,ϕt\)E\_\{t\}^\{g\}\\leftarrow\\textsc\{ZeroShot\}\(g;\\,\\mathbf\{x\}\_\{t\},\\hat\{y\}\_\{t\},\\boldsymbol\{\\phi\}\_\{t\}\)
10:endfor

11:endfor

12:// Phase 2: Evaluation \(per experiment\)

13:forexperiments

ee, conditions

c∈𝒞ec\\in\\mathcal\{C\}\_\{e\}, instances

𝐱t\\mathbf\{x\}\_\{t\}, judges

j∈𝒥j\\in\\mathcal\{J\}do

14:Store

Judge​\(j;BuildPrompt​\(𝐱t,c,Et\)\)\\textsc\{Judge\}\(j;\\,\\textsc\{BuildPrompt\}\(\\mathbf\{x\}\_\{t\},c,E\_\{t\}\)\)in

ℛ\\mathcal\{R\}
15:endfor

16:return

ℛ\\mathcal\{R\}

Algorithm[1](https://arxiv.org/html/2605.26770#alg1)gives an overview of the two\-phase design shared by all five experiments\. Phase 1 generates 120 NLEs once \(60 instances×\\times2 generators\), reused across experiments\. Phase 2 then runs five downstream\-task experiments, each ablating a subset of\{\\\{features, SHAP, metrics, NLE\}\\\}– the condition set𝒞e\\mathcal\{C\}\_\{e\}– to isolate the marginal contribution of the NLE; Figure[3](https://arxiv.org/html/2605.26770#S3.F3)visualises this framework, and the individual tasks and conditions𝒞e\\mathcal\{C\}\_\{e\}are introduced in §[4](https://arxiv.org/html/2605.26770#S4)–§[8](https://arxiv.org/html/2605.26770#S8)\. The corpusℛ\\mathcal\{R\}contains 2,730 judgments analysed via the mixed\-effects models of §[3\.3](https://arxiv.org/html/2605.26770#S3.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.26770v1/x1.png)Figure 2:XGBoost one\-step\-ahead predictions vs\. actual weekly consumption \(kWh\)\. Grey: training period; coloured: test set\.E1 ClosenessError bucket? smallmedlargevery largeE2 PlacebicError bucket? smallmedlargevery largeE3 CounterfactualDirection after perturbation? UPDOWNSIMILARE4 Mental ModelError bucket?\(after 5 training examples\) smallmedlargevery largeE5 Selective RelianceReliable or unreliable? reliableunreliableAll experiments also ask:How confident are you? \(1–5\)lag\_1: 166\.27, lag\_2: 162\.28, lag\_3: 189\.77, …, lag\_7: 129\.44, weekofyear: 42, holidays: 0Featuresalwayspresent168\.76 kWhPredictionalwayspresentlag\_6:−14\.26\-14\.26, lag\_5:−9\.74\-9\.74, lag\_7:\+6\.14\+6\.14, …X \(SHAP\)R2= 0\.686, MAE = 20\.55, RMSE = 25\.04T \(Metrics\)Key Influences:The prediction \(168\.76 kWh\) is below the average baseline \(177\.36 kWh\) due to lower past consumption\. The largest downward impact came from energy use 6 weeks prior…E \(NLE\)LLM Judge InputLLM JudgeGPT\-4oDeepSeek\-R1

Figure 3:Experimental framework\. Left: the LLM judge receives up to five information pieces per instance – features and prediction are always present; SHAP values \(X\), model metrics \(T\), and the NLE \(E\) vary by condition\. Right: the five downstream tasks with their response categories and confidence rating \(1–5\)\.### 3\.1Dataset and Prediction Model

We use the UCI Individual Household Electric Power Consumption dataset\(Hébrail and Bérard,[2012](https://arxiv.org/html/2605.26770#bib.bib25)\)– 2,075,259 minute\-level measurements from a household near Paris \(Dec 2006–Nov 2010\), resampled to weekly granularity\. FollowingLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\)we engineer nine features: seven autoregressive lags \(lag\_1–lag\_7, prior weeks’ consumption in kWh\), ISO week number, and French public\-holiday count per week\. After dropping rows with NaN lag values the dataset contains 200 weekly observations; a chronological 70/30 split yields 140 training and 60 test instances \(full preprocessing in Appendix[A](https://arxiv.org/html/2605.26770#A1)\)\. The domain was chosen for interpretability: consumption is governed by mechanisms \(winter heating, holidays, weekly routines\) that non\-technical readers can reason about from world knowledge alone – a transparent setting in which, if NLEs cannot help, the prospects elsewhere are limited\.

XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.26770#bib.bib12)\)\(the modelffof Algorithm[1](https://arxiv.org/html/2605.26770#alg1)\) achievesR2=0\.69R^\{2\}=0\.69, MAE=20\.55=20\.55kWh, RMSE=25\.04=25\.04kWh on the test set \(Figure[2](https://arxiv.org/html/2605.26770#S3.F2); absolute percentage errors span 0\.3–80\.2%\)\. All predictions are one\-step\-ahead \(true preceding lags as input, not recursive outputs\)\. For each test instance𝐱t\\mathbf\{x\}\_\{t\}we obtain a predictiony^t=f​\(𝐱t\)\\hat\{y\}\_\{t\}=f\(\\mathbf\{x\}\_\{t\}\)and SHAP TreeExplainer attributions\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.26770#bib.bib37); Lundberg et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib36)\)ϕt=χ​\(f,𝐱t\)\\boldsymbol\{\\phi\}\_\{t\}=\\chi\(f,\\mathbf\{x\}\_\{t\}\), decomposingy^t\\hat\{y\}\_\{t\}into additive per\-feature contributions relative to a base value\. The full hyperparameter configuration, training procedure, and SHAP computation code are provided in Appendix[A](https://arxiv.org/html/2605.26770#A1)\.

### 3\.2NLE Generation and Evaluation

#### Fixed pipeline\.

Each factor is justified empirically byLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\): XGBoost yielded the highest NLE quality \(d=0\.48d=0\.48–0\.820\.82over alternatives\); SHAP and LIME produced equivalent quality \(ω2=0\.02\\omega^\{2\}=0\.02\), and TreeExplainer additionally offers exact Shapley values; zero\-shot achieved near\-equivalent quality to self\-consistency \(4\.36 vs\. 4\.50/5\) at7×7\\timeslower token cost; GPT\-4o \(4\.29/5\) and DeepSeek\-R1 \(4\.66/5\) were the two highest\-quality generators \(Llama\-3\-8B was excluded for low quality and inconsistent formatting\)\. GPT\-4o is accessed via Azure OpenAI \(deploymentgpt\-4o, API version2024\-10\-21\); DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib23)\)via OpenRouter \(deepseek/deepseek\-r1\)\. The two span dense vs\. Mixture\-of\-Experts architectures and RLHF vs\. reasoning\-focused RL training paradigms\.

#### Generation\.

For each test instance𝐱t\\mathbf\{x\}\_\{t\}and each generatorg∈𝒢g\\in\\mathcal\{G\}, we obtain one NLEEtgE\_\{t\}^\{g\}at sampling temperatureτ=1\.0\\tau=1\.0, yielding 120 NLEs \(60×260\\times 2\)\. The generation prompt is taken verbatim fromLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\): a system prompt plus a human message supplying domain context, performance metrics, the prediction, raw features, and SHAP values sorted by absolute magnitude\. Output is capped at six bullets and 200 words \(controlling verbosity bias;Gu et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib21)\); full prompt, worked instance, and example output in Appendix[B](https://arxiv.org/html/2605.26770#A2)\. The same 120 NLEs are reused across E1–E4; E5 uses separately generated NLEs over poisoned inputs\.

#### Judge design\.

Each of the five experiments is a downstream decision\-making study in which an LLM judge – standing in for the non\-expert reader the NLE pipeline ultimately targets – is shown the XGBoost prediction together with a varying subset of\{\\\{raw features, SHAP attributions, performance metrics, NLE\}\\\}\(the condition𝒞e\\mathcal\{C\}\_\{e\}\) and asked to perform an experiment\-specific task \(e\.g\., classify the prediction’s percentage\-error bucket\); it also reports a 1–5 Likert confidence\. The two LLMs that produced the NLEs \(GPT\-4o, DeepSeek\-R1\) also act as judges,𝒥=𝒢\\mathcal\{J\}=\\mathcal\{G\}in Algorithm[1](https://arxiv.org/html/2605.26770#alg1): NLE conditions yield four generator–judge combinations per instance, no\-NLE conditions yield two \(judges only, since there is no NLE and hence no generator dimension\); cross\-family combinations cancel self\-preference bias\(Zheng et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib55); Gu et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib21)\), and judge identity is carried as a fixed effect in every statistical model\. Judge prompts are G\-Eval\-style with chain\-of\-thought evaluation steps, run at sampling temperatureτ=0\.0\\tau=0\.0for deterministic evaluation; the structured\-JSON response carries the answer, the confidence rating, and a one\-sentence reasoning\. The exact prompt for each experiment is reproduced verbatim in Appendix[C](https://arxiv.org/html/2605.26770#A3), alongside a worked example of a complete judge human message for one instance\. Whether the NLE helps is then answered statistically: for each experiment we compare task accuracy across information conditions via the mixed\-effects models of §[3\.3](https://arxiv.org/html/2605.26770#S3.SS3)\. Figure[3](https://arxiv.org/html/2605.26770#S3.F3)summarises the framework – the left panel maps each condition to the information pieces in the judge’s human message \(features \+ prediction always present; SHAP, metrics, and NLE toggled\), the right panel lists the five tasks\.

### 3\.3Statistical Analysis

Within each experiment, the same instances appear under every condition \(within\-instance repeated measures\), so judgments on the same instance are not independent\. We therefore use mixed\-effects models throughout, with a random intercept per instance to absorb the residual variance shared by judgments on the same item\.

#### Accuracy\.

Binary correctnessyi​j∈\{0,1\}y\_\{ij\}\\in\\\{0,1\\\}for judgmentiion instancejjis modelled with a generalised linear mixed model \(GLMM\) using a binomial family and logit link, fitted by maximum likelihood vialme4’sglmer\(Bates et al\.,[2015](https://arxiv.org/html/2605.26770#bib.bib6)\):

logit​\(P​\(yi​j=1\)\)=\\displaystyle\\text\{logit\}\\big\(P\(y\_\{ij\}\{=\}1\)\\big\)=\{\}β0\+𝜷c⊤​Ci​j\+βj​Ji​j\\displaystyle\\beta\_\{0\}\+\\boldsymbol\{\\beta\}\_\{c\}^\{\\\!\\top\}C\_\{ij\}\+\\beta\_\{j\}J\_\{ij\}\(2\)\+𝜷c×j⊤​\(C×J\)i​j\+uj,\\displaystyle\{\}\+\\boldsymbol\{\\beta\}\_\{c\\times j\}^\{\\\!\\top\}\(C\{\\times\}J\)\_\{ij\}\+u\_\{j\},whereCi​jC\_\{ij\}is a dummy\-coded vector of condition indicators,Ji​jJ\_\{ij\}the judge indicator \(GPT\-4o vs\. DeepSeek\-R1\),\(C×J\)i​j\(C\{\\times\}J\)\_\{ij\}their interaction \(carried whenever the model converges\), anduj∼𝒩​\(0,σu2\)u\_\{j\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{u\}^\{2\}\)the random intercept for instancejj\. The omnibus condition effect is tested via the likelihood\-ratio test \(LRT\) between \([2](https://arxiv.org/html/2605.26770#S3.E2)\) and a reduced model that drops bothCCandC×JC\{\\times\}J\. Pairwise contrasts are extracted as Waldzz\-tests on estimated marginal means \(EMMs\) with Holm\-Bonferroni correction\(Lenth,[2024](https://arxiv.org/html/2605.26770#bib.bib33); Holm,[1979](https://arxiv.org/html/2605.26770#bib.bib26)\)\. The primary effect size is the odds ratioOR=exp⁡\(βc\)\\text\{OR\}=\\exp\(\\beta\_\{c\}\):OR=1\\text\{OR\}=1indicates no effect,OR\>1\\text\{OR\}\>1multiplies the odds of a correct judgment by that factor, and the 95% confidence interval \(CI\) indicates precision\.

#### Confidence\.

Ordinal confidence \(1–5 Likert\) is modelled with a cumulative link mixed model \(CLMM\) via theordinalpackage\(Christensen,[2019](https://arxiv.org/html/2605.26770#bib.bib14)\):

P\(confi​j≤k\)=logit−1\(\\displaystyle P\(\\text\{conf\}\_\{ij\}\\leq k\)=\\text\{logit\}^\{\-1\}\\big\(θk−𝜷c⊤​Ci​j\\displaystyle\\theta\_\{k\}\-\\boldsymbol\{\\beta\}\_\{c\}^\{\\\!\\top\}C\_\{ij\}\(3\)−βjJi​j−uj\),\\displaystyle\{\}\-\\beta\_\{j\}J\_\{ij\}\-u\_\{j\}\\big\),k=1,…,4k=1,\\ldots,4, withθ1<θ2<θ3<θ4\\theta\_\{1\}<\\theta\_\{2\}<\\theta\_\{3\}<\\theta\_\{4\}the thresholds separating the five Likert categories\. Under the proportional\-odds assumption, a positiveβc\\beta\_\{c\}multiplies the odds of reporting≥k\\geq krather than<k<kbyexp⁡\(βc\)\\exp\(\\beta\_\{c\}\)at every threshold\. To diagnose miscalibrated trust we additionally fit \([3](https://arxiv.org/html/2605.26770#S3.E3)\) with a correctness×\\timescondition interaction \(the*overconfidence*model\): a positive interaction term indicates that confidence rises with the condition even among incorrect judgments\. When the interaction model fails to converge, the additive model is reported as the primary fit and the interaction is evaluated separately via LRT\.

#### Bayesian analysis and ROPE\.

A frequentist non\-rejection does not by itself establish the*absence*of an effect\. For every null accuracy result we therefore fit a Bayesian analogue of \([2](https://arxiv.org/html/2605.26770#S3.E2)\) viabrms\(Bürkner,[2017](https://arxiv.org/html/2605.26770#bib.bib10)\)with weakly informative priors –𝒩​\(0,1\.5\)\\mathcal\{N\}\(0,1\.5\)on fixed effects\(Gelman et al\.,[2008](https://arxiv.org/html/2605.26770#bib.bib18)\)and half\-t3​\(0,2\.5\)t\_\{3\}\(0,2\.5\)on the random\-intercept standard deviation – using 4 chains of 4,000 iterations each\. We report posterior odds ratios with 95% credible intervals \(CrI\) and the proportion of posterior mass inside a Region of Practical Equivalence \(ROPE\) – a range of effect sizes we treat as “effectively no effect”\. We fix the ROPE on the log\-odds scale as\[−0\.18,\+0\.18\]\[\-0\.18,\+0\.18\]\(OR∈\[0\.84,1\.20\]\\text\{OR\}\\in\[0\.84,1\.20\], roughly±5\\pm 5pp at a 40% baseline\)\. A posterior with≥95%\\geq 95\\%of its mass inside the ROPE supports practical equivalence to zero; lower proportions quantify how much posterior mass falls outside the region of triviality\.

#### Sensitivity checks\.

For every primary model we run judge\-specific GLMMs, generator and same\-family\-bias tests \(NLE conditions only\), random\-slope specifications, and Friedman / Wilcoxon nonparametric backups\. None of these alters the primary conclusions; full results are in Appendix[D](https://arxiv.org/html/2605.26770#A4)\.

#### Reporting plan\.

Each experiment \(§[4](https://arxiv.org/html/2605.26770#S4)–§[8](https://arxiv.org/html/2605.26770#S8)\) follows the same template:*Design*states the task and conditions;*Results*reports descriptive accuracy and confidence, the omnibus LRT, the NLE OR with 95% CI and Holm\-correctedpp, the confidence CLMM, and \(for nulls\) the Bayesian posterior OR\. Full coefficient tables, pairwise contrasts, calibration outputs, diagnostics, sensitivity analyses, and judge prompts are in Appendices[D](https://arxiv.org/html/2605.26770#A4)and[C](https://arxiv.org/html/2605.26770#A3); §[9](https://arxiv.org/html/2605.26770#S9)synthesises the five experiments\.

### 3\.4Experiment Selection

Explanation*usefulness*is not a single construct: the XAI evaluation literature distinguishes several facets, each requiring its own operationalisation\(Doshi\-Velez and Kim,[2017](https://arxiv.org/html/2605.26770#bib.bib15); Hase and Bansal,[2020](https://arxiv.org/html/2605.26770#bib.bib24); Bansal et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib5); Nauta et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib40)\)\. We chose one experiment per facet, plus one mechanism\-diagnostic control\. E1 \(*forward simulatability*\) and E3 \(*counterfactual simulatability*\) together cover the two simulatability operationalisations ofDoshi\-Velez and Kim \([2017](https://arxiv.org/html/2605.26770#bib.bib15)\)andHase and Bansal \([2020](https://arxiv.org/html/2605.26770#bib.bib24)\): can the judge predict the model’s past performance, and its response to feature perturbations? E4 \(*mental model transfer*\) tests the learning facet\(Bansal et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib5); Bucinca et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib8)\): does prior NLE exposure produce understanding that transfers to a test instance without NLE? E5 \(*selective reliance under distribution shift*\) tests the deployment\-critical question of whether the judge appropriately distrusts predictions on out\-of\-distribution inputs\(Parasuraman and Riley,[1997](https://arxiv.org/html/2605.26770#bib.bib42); Lee and See,[2004](https://arxiv.org/html/2605.26770#bib.bib32); Buçinca et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib9)\)\. E2 \(*placebic content/presence control*\) is not itself a usefulness measure but a mechanism diagnostic\(Langer et al\.,[1978](https://arxiv.org/html/2605.26770#bib.bib31); Shymanski et al\.,[2025a](https://arxiv.org/html/2605.26770#bib.bib47)\): it separates whether any NLE\-induced effect comes from the NLE’s content or merely from the presence of additional text\. Together, the four constructs \(assessment of past performance, prediction of future behaviour, learning, selective reliance\) span the dominant downstream\-usefulness criteria in the XAI literature, and the placebic control isolates the mechanism behind whatever pattern they yield\.

## 4E1: Forward Simulatability

### 4\.1Design

#### Task\.

The judge classifies the XGBoost prediction’s absolute percentage error into one of four ordered buckets –small\(<5%<5\\%\),medium\(55–15%15\\%\),large\(1515–30%30\\%\),very large\(≥30%\\geq 30\\%\) – and reports a 1–5 Likert confidence\.

Table 1:E1 information conditions\. All include prediction, features, model name, and domain context\. The key contrast is X\+T vs\. E\+X\+T\.
#### Information conditions\.

Each instance is evaluated under five conditions \(Table[1](https://arxiv.org/html/2605.26770#S4.T1)\)\. All include the prediction, raw features, model name, and domain context; the structured bundles – SHAP values \(X\) and performance metrics \(T: MAE, RMSE,R2R^\{2\}\) – are added incrementally\.

Standalone NLE conditions \(E, E\+X, E\+T\) are deliberately excluded: NLEs were generated with full context and narrate SHAP and metrics in prose, so including one without the corresponding structured data would leak that information through the narrative\. The only clean NLE test is therefore E\+X\+T vs\. X\+T\. The four no\-NLE conditions yield4×60×2=4804\\times 60\\times 2=480judgments; E\+X\+T uses the2×22\\times 2generator–judge design for60×4=24060\\times 4=240\. TotalN=720N=720\. Ground\-truth bucket distribution across instances: 12 / 29 / 16 / 3 \(small/medium/large/very\-large\)\.

### 4\.2Results

Table 2:E1 accuracy and mean confidence by condition\.#### Accuracy\.

The GLMM shows no omnibus condition effect \(LRTχ2​\(4\)=1\.58\\chi^\{2\}\(4\)=1\.58,p=\.81p=\.81\); the key NLE contrast E\+X\+T vs\. X\+T yieldsOR=0\.83\\text\{OR\}=0\.83\[0\.47,1\.48\]\[0\.47,1\.48\],p=\.53p=\.53– the NLE adds nothing beyond the structured data it narrates\. All pairwise contrasts are non\-significant after Holm correction\. Bayesian posterior for E\+X\+T vs\. Baseline:OR=1\.25\\text\{OR\}=1\.25\[0\.69,2\.18\]\[0\.69,2\.18\], ROPE 34\.8% – consistent with a small, non\-significant effect\.

#### Confidence\.

A CLMM yields a highly significant condition effect \(LRTχ2​\(4\)=179\.1\\chi^\{2\}\(4\)=179\.1,p<2\.2×10−16p<2\.2\\times 10^\{\-16\}\)\. Holm\-corrected pairwise contrasts show every information\-bearing condition \(X\+T, E\+X\+T\) differs from every non\-information one \(Baseline, X, T\) atp<\.0001p<\.0001; the NLE\-specific increment over X\+T is not significant \(p=\.62p=\.62\)\. The correctness×\\timescondition interaction is significant: among*wrong*predictions, mean confidence rises from 3\.46 \(Baseline\) to 3\.90 \(E\+X\+T\)\. Full wrong/correct breakdown in Appendix[D](https://arxiv.org/html/2605.26770#A4)\.

## 5E2: Placebic Control

### 5\.1Design

The task is identical to E1 \(classify error magnitude\)\. All three conditions include features, SHAP, and metrics, and differ only in NLE source:Baseline\(no NLE\),Real NLE\(NLE generated for this instance\), andPlacebo NLE\(NLE generated for a*different*instance, drawn from a random derangement – a permutation with no fixed points – so each placebic NLE describes a different prediction context\)\.

### 5\.2Results

Table 3:E2 accuracy and confidence by condition\.#### Accuracy\.

The omnibus condition effect is null \(χ2​\(2\)=0\.68\\chi^\{2\}\(2\)=0\.68,p=\.71p=\.71\); Holm\-corrected pairwise contrasts all yieldp=1\.0p=1\.0– Real vs\. BaselineOR=1\.26\\text\{OR\}=1\.26, Placebo vs\. BaselineOR=1\.07\\text\{OR\}=1\.07, and the critical Placebo vs\. RealOR=0\.85\\text\{OR\}=0\.85\. Bayesian posteriors corroborate: RealOR=1\.24\\text\{OR\}=1\.24\[0\.68,2\.28\]\[0\.68,2\.28\], PlaceboOR=1\.06\\text\{OR\}=1\.06\[0\.58,1\.89\]\[0\.58,1\.89\]\. The three conditions are indistinguishable on accuracy\.

#### Confidence\.

The additive CLMM \(the interaction model produced near\-zero random\-intercept variance\) yields Real vs\. Baseline1\.101\.10,p=\.002p=\.002; Placebo vs\. Baseline0\.680\.68,p=\.051p=\.051; Real vs\. PlaceboΔ=−0\.42\\Delta=\-0\.42,p=\.12p=\.12\. Both NLE types elevate confidence relative to Baseline; the two do not differ\.

## 6E3: Counterfactual Simulatability

### 6\.1Design

For each test instance we perturb the three most influential features \(by absolute SHAP value\): lag features by±25%\\pm 25\\%of their value \(random sign\),weekofyearby±13\\pm 13weeks \(clamped to\[1,52\]\[1,52\]\),holiday\_week\_countby\+2\+2\. The XGBoost model is re\-evaluated, and the judge predicts whether the output ishigher,lower, orsimilar\(<5%<5\\%change\)\. Both conditions include features, SHAP, and metrics; they differ only in whether the NLE for the original instance is shown\. TotalN=360N=360; ground truth: 28 higher, 24 lower, 8 similar\.

### 6\.2Results

Table 4:E3 direction accuracy by condition\.#### Accuracy\.

Both conditions clear the 33\.3% chance rate\. The X vs\. E\+X contrast yieldsOR=1\.37\\text\{OR\}=1\.37\[0\.79,2\.37\]\[0\.79,2\.37\],p=\.27p=\.27\(Bayesian posteriorOR=1\.36\\text\{OR\}=1\.36\[0\.76,2\.46\]\[0\.76,2\.46\]\)\. Per\-class recall:similaris hardest \(19–25%\),higherandlowercomparable \(55–64%\)\.

#### Confidence\.

Confidence is at ceiling \(∼\\sim4\.0\) in both conditions \(CLMMz=0\.70z=0\.70,p=\.49p=\.49\), leaving no room for NLE\-induced inflation\. A significant correctness×\\timescondition interaction on calibration \(p=\.019p=\.019\) suggests NLEs marginally improve confidence–accuracy tracking without affecting accuracy, but overconfidence among incorrect predictions remains extreme \(∼\\sim71–77%\)\.

## 7E4: Mental Model Transfer

### 7\.1Design

A sliding\-window in\-context paradigm: for each test positioni≥5i\\geq 5\(55 positions\), the judge first sees five preceding instances \(i−5i\-5toi−1i\-1\) as training examples with features, SHAP, metrics, prediction, true value, and correct error bucket; then classifies the bucket of instanceii– with features, SHAP, prediction, but no true value – and*no NLE*\. The two conditions differ only in whether the five training examples include NLEs; the test instance never does\.

### 7\.2Results

Table 5:E4 accuracy by condition\.#### Accuracy\.

The GLMM yields a directional but non\-significant gain \(OR=1\.86\\text\{OR\}=1\.86\[0\.75,4\.61\]\[0\.75,4\.61\],p=\.18p=\.18; Bayesian posteriorOR=1\.86\\text\{OR\}=1\.86\[0\.76,4\.73\]\[0\.76,4\.73\]\)\. The wide interval reflects extreme between\-instance difficulty \(random\-intercept SD=5\.20=5\.20\); trial order is null \(p=\.76p=\.76\), ruling out learning\-over\-time\. A design caveat applies: the sliding window induces 80% overlap between adjacent training sets, producing temporal autocorrelation only imperfectly absorbed by the instance random intercept\.

#### Confidence\.

Confidence is identical across conditions \(mean 3\.57,p=\.97p=\.97\): when NLEs appear only in training, no confidence inflation occurs at test time\.

## 8E5: Selective Reliance

### 8\.1Design

#### Data manipulation\.

The 60 test instances form thebaselinecondition\.Out\-of\-distribution\(OOD\) variants are constructed by season\-dependent lag poisoning – cold\-season weeks are pushed low, warm\-season weeks high \(full code in Appendix[A](https://arxiv.org/html/2605.26770#A1)\)\. XGBoost and TreeExplainer are then re\-run on the poisoned inputs: predictions and SHAP values are internally consistent with the poisoned features but systematically misrepresent real consumption\. NLEs for poisoned instances are generated separately by the same pipeline \(§[3\.2](https://arxiv.org/html/2605.26770#S3.SS2)\) – they faithfully describe the poisoned attributions, which is precisely the rationalisation mechanism under test\.

#### Task\.

The judge classifies each instance as*reliable*or*unreliable*\(ground truth: baseline reliable, OOD unreliable\)\. All conditions include features, SHAP, and metrics\. NLE conditions use the2×22\\times 2generator–judge design \(N×4N\\times 4\); no\-NLE conditions yieldN×2N\\times 2\.

### 8\.2Results

Table 6:E5 accuracy and confidence by data type and NLE presence\.#### Manipulation check\.

Baseline instances are correctly flagged reliable at 95\.0–98\.3%; OOD detection is only 15–30% overall\.

#### False reassurance\.

The factorial GLMM reveals a significant poisoning×\\timesNLE interaction \(LRTχ2​\(1\)=9\.84\\chi^\{2\}\(1\)=9\.84,p=\.002p=\.002\)\. On baseline instances NLEs marginally help \(OR No NLE vs\. NLE=0\.31=0\.31,p=\.08p=\.08\); on OOD instances NLEs significantly*hurt*detection \(OR=3\.13=3\.13,p=\.0003p=\.0003\), halving accuracy from 30\.0% to 15\.0%\. Bayesian analysis: posterior interactionOR=0\.12\\text\{OR\}=0\.12\[0\.04,0\.37\]\[0\.04,0\.37\]– the only effect in this study whose 95% credible interval excludes 1\.

#### Confidence and calibration\.

Confidence is uniformly high across all four conditions \(3\.82–3\.99; CLMM interactionp=\.47p=\.47\)\. Calibration is inverted in OOD \(Somers’D=−0\.44D=\-0\.44no\-NLE,−0\.37\-0\.37NLE\): higher confidence is associated with*lower*accuracy\.

## 9Discussion

E1–E4 yield NLE\-vs\.\-no\-NLE odds ratios between0\.830\.83and1\.861\.86, with Bayesian credible intervals all including unity: across four facets of usefulness that the XAI literature treats as distinct, we detect no accuracy benefit from the high\-quality NLEs ofLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\)\. The one significant accuracy effect \(E5\) is in the opposite direction: under distribution shift, OOD detection falls from30%30\\%to15%15\\%in the NLE’s presence \(interaction CrI\[0\.04,0\.37\]\[0\.04,0\.37\]\)\. Confidence, in contrast, tracks NLE presence rather than accuracy: where NLEs are shown at judgment time, judges report higher confidence \(E1\), and E2 indicates that this elevation persists for placebic NLEs describing a different instance, with real and placebic NLEs yielding indistinguishable confidence and indistinguishable accuracy\. This combination – no detectable accuracy gain, but a content\-insensitive confidence shift – is the empirical pattern we read as the Quality\-Usefulness Gap\.

We suggest two tentative mechanisms, framed as hypotheses since we observe behaviour rather than cognition \(our judges are LLMs; §[Limitations](https://arxiv.org/html/2605.26770#Sx1)\)\. First,*confidence inflation through peripheral processing*, plausibly accounts for E1–E2: the placebic finding – content\-free NLEs raising confidence comparably to faithful ones – echoes the “mindless” processingLanger et al\. \([1978](https://arxiv.org/html/2605.26770#bib.bib31)\)identified for non\-informative justifications and thatShymanski et al\. \([2025a](https://arxiv.org/html/2605.26770#bib.bib47),[b](https://arxiv.org/html/2605.26770#bib.bib48)\)revisited for AI explanations, and matches the Elaboration Likelihood Model’s peripheral\-cue prediction\(Petty and Cacioppo,[1986](https://arxiv.org/html/2605.26770#bib.bib43); Kahneman,[2011](https://arxiv.org/html/2605.26770#bib.bib30)\): fluent text appears to act as a credibility cue, updating confidence without the engagement needed to check the NLE’s claims against the SHAP values\. Second,*rationalisation of anomalous inputs*, is a candidate reading of E5, and an a priori surprising one: frontier LLMs carry broad world knowledge about energy use and seasonality, and one might have expected an LLM narrator faced with lags far outside the seasonal norm to flag the anomaly in prose alongside reporting the SHAP attributions\. In E5 this did not happen – the pipeline narrated the misleading attributions as if sensible, and OOD detection deteriorated\. Why the LLM did not exploit its world knowledge here – whether the system prompt is too narrow, the SHAP\-centric framing crowds it out, or the capability is weaker than assumed – is an open question; an LLM narrator that reliably surfaced OOD warnings would be a natural design for a more useful pipeline\. What our results show is the present\-day failure mode: faithful narration of post\-hoc attributions over OOD inputs appears to dampen scepticism in our setup, consistent with the adversarial\-attack literature\(Ghorbani et al\.,[2019](https://arxiv.org/html/2605.26770#bib.bib19); Slack et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib49); Artelt et al\.,[2026](https://arxiv.org/html/2605.26770#bib.bib4)\)\. The two readings also help interpret E3: SHAP attributions already encode the directional information a counterfactual prediction requires\. For the judge, the prose restatement is largely a paraphrase\. A common thread, conjectural rather than demonstrated, runs through all three: the surface properties optimised by the quality pipeline – fluency, structure, accessible language – are also those associated with peripheral credibility assessments in the standard XAI literature\(Jacovi et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib28); Vasconcelos et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib54)\); optimising for quality may, in this sense, be inadvertently optimising for persuasiveness\.

Tentative as they are, these readings narrow the design space for what to do about the gap:

1. Don’t ship on quality alone:G\-Eval\-style scores remain a useful development\-time check but do not predict deployment utility; task\-based benchmarks should be reported alongside them\.
2. Evaluate at the edge:E5 shows that an NLE system can be statistically indistinguishable from no NLE on the average input yet cause measurable harm on inputs at the edge – adversarial detection tasks belong in pre\-deployment testing, not as an afterthought\.
3. Decouple uncertainty from narrative:If the rationalisation reading is right, a structurally separate signal of input distributional anomaly \(e\.g\., quantile\-based bounds against the training distribution\) is harder to smooth over with narrative fluency, and would arguably best be displayed*before*the NLE rather than inside it\.
4. Force central processing:The peripheral\-processing reading suggests the most promising interventions are not better explanations but designs that engage the reader actively: cognitive forcing functions\(Buçinca et al\.,[2021](https://arxiv.org/html/2605.26770#bib.bib9)\), devil’s\-advocate framings\(Suh et al\.,[2025](https://arxiv.org/html/2605.26770#bib.bib51)\), dissenting explanations\(Reingold et al\.,[2024](https://arxiv.org/html/2605.26770#bib.bib44)\), and uncertainty\-aware formats\(Vasconcelos et al\.,[2023](https://arxiv.org/html/2605.26770#bib.bib54)\)are all candidates the standard XAI literature has developed but the LLM\-NLE pipeline has yet to adopt\.

Whether any of these closes the gap in our LLM\-judge setting – let alone with human users – is an open empirical question worth answering\.

## 10Conclusion

We tested whether the high\-quality NLEs ofLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\)translate to usefulness across five tasks\. They do not: no significant accuracy benefit in E1–E4 \(allp\>\.18p\>\.18; Bayesian CrIs includeOR=1\\text\{OR\}=1\); E2’s placebic control localises the confidence gains to text presence, not content; and in E5 accuracy drops from 30% to 15% with the NLE \(p=\.002p=\.002, CrI\[0\.04,0\.37\]\[0\.04,0\.37\]\), consistent with false reassurance masking model failure\. We call this theQuality\-Usefulness Gap: explanation quality and usefulness can come apart, with narrative coherence plausibly acting as a trust heuristic\.

## Limitations

#### LLM judges vs\. human users\.

All usefulness experiments use LLM judges rather than human participants\. This enables controlled, large\-scale evaluation with a uniform judge protocol, but the cognitive mechanisms we invoke \(peripheral processing, automation bias, the Elaboration Likelihood Model, System 1/System 2 distinctions\) are drawn from human psychology\. Our findings should be read as*behavioural analogues*of the human phenomena, consistent with but not direct evidence of the underlying cognitive mechanisms\.Bona et al\. \([2024](https://arxiv.org/html/2605.26770#bib.bib7)\)show that LLM judges track human ratings on coarse\-grained qualities but are systematically less reliable on dimensions requiring deep domain knowledge or numerical verification – precisely the properties our usefulness tasks depend on\. It is plausible that the Quality\-Usefulness Gap is*larger*with humans, who are typically more susceptible to fluency heuristics, but human validation studies remain the necessary next step\.

#### Single domain, single pipeline\.

All experiments use household electricity consumption forecasting – a low\-stakes interpretable domain – with the SHAP TreeExplainer \+ XGBoost pipeline fixed byLukassen et al\. \([2025](https://arxiv.org/html/2605.26770#bib.bib35)\)\. The mechanisms are not domain\-specific, but generalisability to higher\-stakes domains \(healthcare, credit scoring, legal\) is not established; the direction of the gap is unlikely to reverse, and consequences are likely to be larger\. LIME\-narrated NLE usefulness is not tested here \(the factorial study found SHAP and LIME equivalent on quality\), and all predictions are one\-step\-ahead – recursive forecasting would degrade performance, with potentially different NLEs\.

#### Compound LLM factor\.

GPT\-4o and DeepSeek\-R1 differ on multiple axes simultaneously – architecture \(dense vs\. MoE\), parameter count, alignment strategy, access modality\. A controlled future design would vary one axis at a time, but frontier LLMs are not released in such orthogonal variants\. Our findings therefore speak to practical, market\-available options rather than to which architectural or training property is responsible\.

#### Poisoning specificity \(E5\)\.

The OOD construction pushes lag features toward seasonal extremes\. Other distribution shifts – gradual drift, covariate shift, adversarial perturbation of non\-lag features – may produce different detection patterns\.

## References

- Ajwani et al\. \(2024\)Rohan Ajwani, Shashidhar Reddy Javaji, Frank Rudzicz, and Zining Zhu\. 2024\.Llm\-generated black\-box explanations can be adversarially helpful\.*arXiv preprint arXiv:2405\.06800*\.
- Aksu et al\. \(2024\)Taha Aksu, Chenghao Liu, Amrita Saha, Sarah Tan, Caiming Xiong, and Doyen Sahoo\. 2024\.[XForecast: Evaluating natural language explanations for time series forecasting](https://arxiv.org/abs/2410.14180)\.*Preprint*, arXiv:2410\.14180\.
- Arrieta et al\. \(2020\)Alejandro Barredo Arrieta, Natalia Díaz\-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil\-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera\. 2020\.[Explainable artificial intelligence \(XAI\): Concepts, taxonomies, opportunities and challenges toward responsible AI](https://doi.org/10.1016/j.inffus.2019.12.012)\.*Information Fusion*, 58:82–115\.
- Artelt et al\. \(2026\)André Artelt, Shivam Sharma, François Lecué, and Barbara Hammer\. 2026\.The effect of data poisoning on counterfactual explanations\.*Information Fusion*\.
- Bansal et al\. \(2021\)Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S\. Weld\. 2021\.[Does the whole exceed its parts? the effect of ai explanations on complementary team performance](https://doi.org/10.1145/3411764.3445717)\.In*CHI Conference on Human Factors in Computing Systems*\.
- Bates et al\. \(2015\)Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker\. 2015\.[Fitting linear mixed\-effects models using lme4](https://doi.org/10.18637/jss.v067.i01)\.*Journal of Statistical Software*, 67\(1\):1–48\.
- Bona et al\. \(2024\)Francesco Bombassei De Bona, Tim Miller, Gabriele Dominici, Marc Langheinrich, and Martin Gjoreski\. 2024\.Evaluating explanations through llms: Beyond traditional user studies\.*arXiv preprint arXiv:2410\.17781*\.
- Bucinca et al\. \(2020\)Zana Bucinca, Phoebe Lin, Krzysztof Z\. Gajos, and Elena L\. Glassman\. 2020\.[Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems](https://doi.org/10.1145/3377325.3377498)\.In*Proceedings of the 25th International Conference on Intelligent User Interfaces*\.
- Buçinca et al\. \(2021\)Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z\. Gajos\. 2021\.[To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI\-assisted decision\-making](https://doi.org/10.1145/3449287)\.In*Proceedings of the ACM on Human\-Computer Interaction*, volume 5, pages 1–21\. ACM\.
- Bürkner \(2017\)Paul\-Christian Bürkner\. 2017\.[brms: An R package for Bayesian multilevel models using Stan](https://doi.org/10.18637/jss.v080.i01)\.*Journal of Statistical Software*, 80\(1\):1–28\.
- Cedro and Martens \(2025\)Mateusz Cedro and David Martens\. 2025\.Graphxain: Narratives to explain graph neural networks\.*arXiv preprint arXiv:2411\.02540*\.
- Chen and Guestrin \(2016\)Tianqi Chen and Carlos Guestrin\. 2016\.[XGBoost: A scalable tree boosting system](https://doi.org/10.1145/2939672.2939785)\.In*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pages 785–794\. ACM\.
- Chen et al\. \(2023\)Valerie Chen, Q\. Vera Liao, Jennifer Wortman Vaughan, and Gagan Bansal\. 2023\.Understanding the role of human intuition on reliance in human\-ai decision\-making with explanations\.*Proceedings of the ACM on Human\-Computer Interaction*, 7\(CSCW1\)\.
- Christensen \(2019\)Rune Haubo Bojesen Christensen\. 2019\.[ordinal—regression models for ordinal data](https://cran.r-project.org/package=ordinal)\.R package version 2019\.12\-10\.
- Doshi\-Velez and Kim \(2017\)Finale Doshi\-Velez and Been Kim\. 2017\.Towards a rigorous science of interpretable machine learning\.*arXiv preprint arXiv:1702\.08608*\.
- Dwiyanti et al\. \(2025\)Latifa Dwiyanti, Sergio Ryan Wibisono, and Hidetaka Nambo\. 2025\.Contextualshap: Enhancing shap explanations through contextual language generation\.*arXiv preprint arXiv:2512\.07178*\.
- Fan et al\. \(2026\)Shutong Fan, Lan Zhang, and Xiaoyong Yuan\. 2026\.When ai persuades: Adversarial explanation attacks on human trust in ai\-assisted decision making\.*arXiv preprint arXiv:2602\.04003*\.
- Gelman et al\. \(2008\)Andrew Gelman, Aleks Jakulin, Maria Grazia Pittau, and Yu\-Sung Su\. 2008\.[A weakly informative default prior distribution for logistic and other regression models](https://doi.org/10.1214/08-AOAS191)\.*The Annals of Applied Statistics*, 2\(4\):1360–1383\.
- Ghorbani et al\. \(2019\)Amirata Ghorbani, Abubakar Abid, and James Zou\. 2019\.Interpretation of neural networks is fragile\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 33, pages 3681–3688\.
- Gu et al\. \(2025\)Jiawei Gu, Xuhui Jiang, Zhichao Shi, and 1 others\. 2025\.A survey on llm\-as\-a\-judge\.*arXiv preprint arXiv:2411\.15594*\.
- Gu et al\. \(2024\)Jiawei Gu, Xuhui Xu, Junyi Ye, Tianyi Zhang, Ming Cheng, and Wenbo Jiao\. 2024\.[A survey on LLM\-as\-a\-judge](https://arxiv.org/abs/2411.15594)\.*Preprint*, arXiv:2411\.15594\.
- Guidotti et al\. \(2018\)Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi\. 2018\.[A survey of methods for explaining black box models](https://doi.org/10.1145/3236009)\.*ACM Computing Surveys*, 51\(5\):1–42\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, He Zhang, and 1 others\. 2025\.[DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning](https://arxiv.org/abs/2501.12948)\.*Preprint*, arXiv:2501\.12948\.
- Hase and Bansal \(2020\)Peter Hase and Mohit Bansal\. 2020\.[Evaluating explainable AI: Which algorithmic explanations help users predict model behavior?](https://doi.org/10.18653/v1/2020.acl-main.491)In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 5540–5552\. Association for Computational Linguistics\.
- Hébrail and Bérard \(2012\)Georges Hébrail and Alice Bérard\. 2012\.[Individual household electric power consumption data set](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption)\.UCI Machine Learning Repository\.Accessed: 2024\.
- Holm \(1979\)Sture Holm\. 1979\.[A simple sequentially rejective multiple test procedure](https://www.jstor.org/stable/4615733)\.*Scandinavian Journal of Statistics*, 6\(2\):65–70\.
- Im et al\. \(2023\)Shawn Im, Jacob Andreas, and Yilun Zhou\. 2023\.Evaluating the utility of model explanations for model development\.In*NeurIPS Workshop on Attributing Model Behavior at Scale*\.
- Jacovi et al\. \(2021\)Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg\. 2021\.[Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI](https://doi.org/10.1145/3442188.3445923)\.In*Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency*, pages 624–635\. ACM\.
- Jesus et al\. \(2021\)Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama\. 2021\.[How can i choose an explainer? an application\-grounded evaluation of post\-hoc explanations](https://doi.org/10.1145/3442188.3445941)\.In*Conference on Fairness, Accountability, and Transparency \(FAccT\)*\.
- Kahneman \(2011\)Daniel Kahneman\. 2011\.*Thinking, Fast and Slow*\.Farrar, Straus and Giroux\.
- Langer et al\. \(1978\)Ellen J\. Langer, Arthur Blank, and Benzion Chanowitz\. 1978\.[The mindlessness of ostensibly thoughtful action: The role of “placebic” information in interpersonal interaction](https://doi.org/10.1037/0022-3514.36.6.635)\.*Journal of Personality and Social Psychology*, 36\(6\):635–642\.
- Lee and See \(2004\)John D\. Lee and Katrina A\. See\. 2004\.[Trust in automation: Designing for appropriate reliance](https://doi.org/10.1518/hfes.46.1.50.30392)\.*Human Factors*, 46\(1\):50–80\.
- Lenth \(2024\)Russell V\. Lenth\. 2024\.[emmeans: Estimated marginal means, aka least\-squares means](https://cran.r-project.org/package=emmeans)\.R package version 1\.10\.0\.
- Liu et al\. \(2023\)Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\. 2023\.[G\-Eval: NLG evaluation using GPT\-4 with better human alignment](https://doi.org/10.18653/v1/2023.emnlp-main.153)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 2511–2522\. Association for Computational Linguistics\.
- Lukassen et al\. \(2025\)Fabian Lukassen, Jan Herrmann, Christoph Weisser, Benjamin Saefken, and Thomas Kneib\. 2025\.[From XAI to stories: LLM\-generated natural language explanations of model predictions](https://arxiv.org/abs/2601.02224)\.*Preprint*, arXiv:2601\.02224\.
- Lundberg et al\. \(2020\)Scott M\. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M\. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su\-In Lee\. 2020\.[From local explanations to global understanding with explainable AI for trees](https://doi.org/10.1038/s42256-019-0138-9)\.*Nature Machine Intelligence*, 2\(1\):56–67\.
- Lundberg and Lee \(2017\)Scott M\. Lundberg and Su\-In Lee\. 2017\.[A unified approach to interpreting model predictions](https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html)\.In*Advances in Neural Information Processing Systems*, volume 30, pages 4765–4774\. Curran Associates, Inc\.
- Martens et al\. \(2025\)David Martens, Camille Dams, James Hinns, and Mark Vergouwen\. 2025\.[Tell me a story\! narrative\-driven XAI with large language models](https://doi.org/10.1016/j.dss.2024.114218)\.*Decision Support Systems*, 191:114218\.
- Miller \(2019\)Tim Miller\. 2019\.[Explanation in artificial intelligence: Insights from the social sciences](https://doi.org/10.1016/j.artint.2018.07.007)\.*Artificial Intelligence*, 267:1–38\.
- Nauta et al\. \(2023\)Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, Jörg Schlötterer, Maurice van Keulen, and Christin Seifert\. 2023\.From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai\.*ACM Computing Surveys*, 55\(13s\)\.
- Naveed et al\. \(2024\)Sidra Naveed, Gunnar Stevens, and Dean Robin\-Kern\. 2024\.[An overview of the empirical evaluation of explainable AI \(XAI\)](https://doi.org/10.3390/app142311288)\.*Applied Sciences*, 14\(23\):11288\.
- Parasuraman and Riley \(1997\)Raja Parasuraman and Victor Riley\. 1997\.[Humans and automation: Use, misuse, disuse, abuse](https://doi.org/10.1518/001872097778543886)\.*Human Factors*, 39\(2\):230–253\.
- Petty and Cacioppo \(1986\)Richard E\. Petty and John T\. Cacioppo\. 1986\.[The elaboration likelihood model of persuasion](https://doi.org/10.1016/S0065-2601(08)60214-2)\.*Advances in Experimental Social Psychology*, 19:123–205\.
- Reingold et al\. \(2024\)Omer Reingold, Judy Hanwen Shen, and Aditi Talati\. 2024\.Dissenting explanations: Leveraging disagreement to reduce model overreliance\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, pages 21537–21544\.
- Ribeiro et al\. \(2016\)Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin\. 2016\.[“Why Should I Trust You?”: Explaining the predictions of any classifier](https://doi.org/10.1145/2939672.2939778)\.In*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pages 1135–1144\. ACM\.
- Schemmer et al\. \(2022\)Max Schemmer, Patrick Hemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger\. 2022\.[Should I follow AI\-based advice? measuring appropriate reliance in human\-AI decision\-making](https://doi.org/10.1145/3544548.3581095)\.In*CHI Conference on Human Factors in Computing Systems*, pages 1–15\. ACM\.
- Shymanski et al\. \(2025a\)Joe Shymanski, Jacob Brue, and Sandip Sen\. 2025a\.Beyond satisfaction: From placebic to actionable explanations for enhanced understandability\.*arXiv preprint arXiv:2512\.06591*\.
- Shymanski et al\. \(2025b\)Joe Shymanski, Jacob Brue, and Sandip Sen\. 2025b\.Not all explanations are created equal: Investigating the pitfalls of current xai evaluation\.*arXiv preprint arXiv:2511\.03730*\.
- Slack et al\. \(2020\)Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju\. 2020\.[Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods](https://doi.org/10.1145/3375627.3375830)\.In*Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society*, pages 180–186\. ACM\.
- Spillner et al\. \(2026\)Laura Spillner, Rachel Ringe, Robert Porzel, and Rainer Malaka\. 2026\.Not all trust is the same: Effects of decision workflow and explanations in human\-ai decision making\.*arXiv preprint arXiv:2603\.05229*\.
- Suh et al\. \(2025\)Ashley Suh, Kenneth Alperin, Harry Li, and Steven R\. Gomez\. 2025\.Don’t just translate, agitate: Using large language models as devil’s advocates for ai explanations\.In*Human\-centered Explainable AI Workshop \(HCXAI\) @ CHI*\.
- Swamy et al\. \(2025\)Vinitra Swamy, Davide Romano, Bhargav Srinivasa Desikan, Oana\-Maria Camburu, and Tanja Käser\. 2025\.illuminate: An llm\-xai framework leveraging social science explanation theories\.*arXiv preprint arXiv:2409\.08027*\.
- Theissler et al\. \(2022\)Andreas Theissler, Francesco Spinnato, Udo Schlegel, and Riccardo Guidotti\. 2022\.[Explainable AI for time series classification: A review, taxonomy and research directions](https://doi.org/10.1109/ACCESS.2022.3207765)\.*IEEE Access*, 10:100700–100724\.
- Vasconcelos et al\. \(2023\)Helena Vasconcelos, Matthew Jörke, Madeleine Grunde\-McLaughlin, Tobias Gerstenberg, Michael S\. Bernstein, and Ranjay Krishna\. 2023\.[Explanations can reduce overreliance on ai systems during decision\-making](https://doi.org/10.1145/3579605)\.*Proceedings of the ACM on Human\-Computer Interaction*\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, and 1 others\. 2023\.[Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)\.In*Advances in Neural Information Processing Systems*, volume 36\.
- Zytek et al\. \(2024a\)Alexandra Zytek, Dongyu Liu, Andrei Vasiliev, and Kalyan Veeramachaneni\. 2024a\.[LLMs for XAI: Future directions for explaining explanations](https://arxiv.org/abs/2405.06064)\.*Preprint*, arXiv:2405\.06064\.
- Zytek et al\. \(2024b\)Alexandra Zytek, Sara Pido, Sarah Alnegheimish, Laure Berti\-Équille, and Kalyan Veeramachaneni\. 2024b\.[Explingo: Explaining AI predictions using large language models](https://arxiv.org/abs/2412.05145)\.In*2024 IEEE International Conference on Big Data \(BigData\)*\. IEEE\.

## Appendix AExperimental Details

#### Preprocessing\.

Missing values \(∼\\sim1\.25%\) are filled by linear interpolation, kilowatt\-minutes are converted to kilowatt\-hours, and minute\-level data are aggregated to weekly totals anchored on Mondays \(ISO week alignment\)\. After dropping rows with NaN lag values, the dataset contains 200 weekly observations with 9 features \(lag\_1–lag\_7,weekofyear,holiday\_week\_count\) and the target \(weekly kWh\)\. A chronological 70/30 split yields 140 training and 60 test instances \(approximately Dec 2006 – Aug 2009 train; Sept 2009 – Nov 2010 test\)\.

Data Loading and Feature Engineering \(excerpt\)[⬇](data:text/plain;base64,ZGVmIGxvYWRfd2Vla2x5KGZpbGVfcGF0aCwgY291bnRyeT0iRlIiKToKICAgIGRmID0gcGQucmVhZF9jc3YoZmlsZV9wYXRoLCBzZXA9IjsiLCBuYV92YWx1ZXM9Ij8iKQogICAgZGZbIkRhdGVUaW1lIl0gPSBwZC50b19kYXRldGltZShkZlsiRGF0ZSJdCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICsgIiAiICsgZGZbIlRpbWUiXSkKICAgIGRmWyJHbG9iYWxfYWN0aXZlX3Bvd2VyIl0gPSAoCiAgICAgICAgZGZbIkdsb2JhbF9hY3RpdmVfcG93ZXIiXS5hc3R5cGUoZmxvYXQpCiAgICAgICAgICAuaW50ZXJwb2xhdGUoKSAqICgxLzYwKQogICAgKQogICAgdyA9IGRmWyJHbG9iYWxfYWN0aXZlX3Bvd2VyIl0ucmVzYW1wbGUoIlctTU9OIikuc3VtKCkKICAgIHcgPSB3LnRvX2ZyYW1lKCJUcnVlX1ZhbHVlIikucmVzZXRfaW5kZXgoKQogICAgd1sid2Vla29meWVhciJdID0gd1siRGF0ZVRpbWUiXS5kdC5pc29jYWxlbmRhcigpLndlZWsKICAgIGZyX2hvbGlkYXlzID0gaG9saWRheXMuQ291bnRyeUhvbGlkYXkoY291bnRyeSkKICAgIHdbImhvbGlkYXlfd2Vla19jb3VudCJdID0gd1siRGF0ZVRpbWUiXS5hcHBseSgKICAgICAgICBsYW1iZGEgZDogc3VtKChkIC0gcGQuVGltZWRlbHRhKGRheXM9aSkpLmRhdGUoKQogICAgICAgICAgICAgICAgICAgICAgIGluIGZyX2hvbGlkYXlzIGZvciBpIGluIHJhbmdlKDcpKQogICAgKQogICAgcmV0dXJuIHcKCk5fTEFHUyA9IDcKZm9yIGkgaW4gcmFuZ2UoMSwgTl9MQUdTICsgMSk6CiAgICB3ZWVrbHlbZiJsYWdfe2l9Il0gPSB3ZWVrbHlbIlRydWVfVmFsdWUiXS5zaGlmdChpKQp3ZWVrbHkgPSB3ZWVrbHkuZHJvcG5hKCkgICMgLT4gMjAwIHJvd3MKCmN1dCA9IGludChsZW4od2Vla2x5KSAqIDAuNzApCnRyYWluLCB0ZXN0ID0gd2Vla2x5Lmlsb2NbOmN1dF0sIHdlZWtseS5pbG9jW2N1dDpdCiMgdHJhaW46IDE0MCBpbnN0YW5jZXMsIHRlc3Q6IDYwIGluc3RhbmNlcw==)defload\_weekly\(file\_path,country="FR"\):df=pd\.read\_csv\(file\_path,sep=";",na\_values="?"\)df\["DateTime"\]=pd\.to\_datetime\(df\["Date"\]\+""\+df\["Time"\]\)df\["Global\_active\_power"\]=\(df\["Global\_active\_power"\]\.astype\(float\)\.interpolate\(\)\*\(1/60\)\)w=df\["Global\_active\_power"\]\.resample\("W\-MON"\)\.sum\(\)w=w\.to\_frame\("True\_Value"\)\.reset\_index\(\)w\["weekofyear"\]=w\["DateTime"\]\.dt\.isocalendar\(\)\.weekfr\_holidays=holidays\.CountryHoliday\(country\)w\["holiday\_week\_count"\]=w\["DateTime"\]\.apply\(lambdad:sum\(\(d\-pd\.Timedelta\(days=i\)\)\.date\(\)infr\_holidaysforiinrange\(7\)\)\)returnwN\_LAGS=7foriinrange\(1,N\_LAGS\+1\):weekly\[f"lag\_\{i\}"\]=weekly\["True\_Value"\]\.shift\(i\)weekly=weekly\.dropna\(\)cut=int\(len\(weekly\)\*0\.70\)train,test=weekly\.iloc\[:cut\],weekly\.iloc\[cut:\]

#### XGBoost configuration\.

Hyperparameters selected via 200\-iteration random search with 4\-fold time\-series cross\-validation \(TimeSeriesSplit\)\. Selection criterion: MAE on the validation folds with early stopping \(200 rounds\)\.

Table 7:XGBoost hyperparameters\.All predictions areone\-step\-ahead: each test instance uses true observed values from all preceding weeks as lag features, not recursive model outputs\. Test\-set performance:R2=0\.69R^\{2\}=0\.69, MAE=20\.55=20\.55kWh, RMSE=25\.04=25\.04kWh\.

#### SHAP computation\.

TreeExplainer\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.26770#bib.bib37); Lundberg et al\.,[2020](https://arxiv.org/html/2605.26770#bib.bib36)\)provides exact Shapley values for tree\-based models in polynomial time\. For each test instance, the output is a vector of nine SHAP values plus a base valueϕ0=177\.37\\phi\_\{0\}=177\.37kWh \(close to the training\-set mean of 177\.53 kWh\), satisfying the efficiency propertyy^=ϕ0\+∑j=19ϕj\\hat\{y\}=\\phi\_\{0\}\+\\sum\_\{j=1\}^\{9\}\\phi\_\{j\}\.

SHAP Computation[⬇](data:text/plain;base64,aW1wb3J0IHNoYXAKZXhwbGFpbmVyICAgPSBzaGFwLlRyZWVFeHBsYWluZXIobW9kZWwpCnNoYXBfdmFsdWVzID0gZXhwbGFpbmVyLnNoYXBfdmFsdWVzKFhfdGVzdCkKYmFzZV92YWx1ZSAgPSBleHBsYWluZXIuZXhwZWN0ZWRfdmFsdWU=)importshapexplainer=shap\.TreeExplainer\(model\)shap\_values=explainer\.shap\_values\(X\_test\)base\_value=explainer\.expected\_value

#### E5 OOD poisoning\.

For E5 \(§[8](https://arxiv.org/html/2605.26770#S8)\), each test instance is converted into an out\-of\-distribution variant by season\-dependent lag poisoning\. The function below produces the poisoned feature row; XGBoost and TreeExplainer are then re\-run on the poisoned inputs to obtain the corresponding prediction and SHAP attributions, and a fresh NLE is generated by the same pipeline as for the original instances \(§[3\.2](https://arxiv.org/html/2605.26770#S3.SS2)\)\.

E5 OOD Poisoning[⬇](data:text/plain;base64,TEFHX0ZFQVRVUkVTID0gW2YibGFnX3tpfSIgZm9yIGkgaW4gcmFuZ2UoMSwgOCldCnJuZyA9IG5wLnJhbmRvbS5kZWZhdWx0X3JuZyhSQU5ET01fU0VFRCkKCmRlZiBwb2lzb25faW5zdGFuY2Uocm93LCBsZXZlbD0ib29kIik6CiAgICAiIiJDcmVhdGUgYW4gT09EIHZhcmlhbnQgb2YgYSB0ZXN0IGluc3RhbmNlLgoKICAgIENvbGQtc2Vhc29uIHdlZWtzICg0NC0xNyk6IGxhZ3MgcHVzaGVkCiAgICBhYm5vcm1hbGx5IExPVyAod2ludGVyLWxvdykuCiAgICBXYXJtLXNlYXNvbiB3ZWVrcyAoMTgtNDMpOiBsYWdzIHB1c2hlZAogICAgYWJub3JtYWxseSBISUdIIChzdW1tZXItaGlnaCkuCiAgICAiIiIKICAgIHBvaXNvbmVkID0gcm93LmNvcHkoKQogICAgaWYgbGV2ZWwgPT0gImJhc2VsaW5lIjoKICAgICAgICByZXR1cm4gcG9pc29uZWQKCiAgICBpZiByb3dbInNlYXNvbiJdID09ICJjb2xkIjoKICAgICAgICB0YXJnZXQgPSAwLjggKiBMQUdfU1RBVFNbIm1pbiJdICAgICMgfjUwLjgga1doCiAgICBlbHNlOgogICAgICAgIHRhcmdldCA9IDEuMiAqIExBR19TVEFUU1sibWF4Il0gICAgIyB+MzI3Ljkga1doCgogICAgZm9yIGksIGxhZyBpbiBlbnVtZXJhdGUoTEFHX0ZFQVRVUkVTKToKICAgICAgICBkZWNheSA9IDEuMCAtIGkgKiAwLjA1ICAgICAgICAgICAgICMgbGFnXzE9MS4wLCBsYWdfNz0wLjcKICAgICAgICBub2lzZSA9IHJuZy51bmlmb3JtKC01LCA1KQogICAgICAgIHBvaXNvbmVkW2xhZ10gPSB0YXJnZXQgKiBkZWNheSArIG5vaXNlCiAgICByZXR1cm4gcG9pc29uZWQKCiMgUmUtcnVuIFhHQm9vc3QgYW5kIFNIQVAgb24gdGhlIHBvaXNvbmVkIGlucHV0cwpYX29vZCA9IHBvaXNvbmVkX2RmW0ZFQVRVUkVTXS52YWx1ZXMKcG9pc29uZWRfZGZbInByZWRpY3Rpb25fb29kIl0gPSBtb2RlbC5wcmVkaWN0KFhfb29kKQpzaGFwX29vZCA9IHNoYXAuVHJlZUV4cGxhaW5lcihtb2RlbCkuc2hhcF92YWx1ZXMoWF9vb2Qp)LAG\_FEATURES=\[f"lag\_\{i\}"foriinrange\(1,8\)\]rng=np\.random\.default\_rng\(RANDOM\_SEED\)defpoison\_instance\(row,level="ood"\):"""CreateanOODvariantofatestinstance\.Cold\-seasonweeks\(44\-17\):lagspushedabnormallyLOW\(winter\-low\)\.Warm\-seasonweeks\(18\-43\):lagspushedabnormallyHIGH\(summer\-high\)\."""poisoned=row\.copy\(\)iflevel=="baseline":returnpoisonedifrow\["season"\]=="cold":target=0\.8\*LAG\_STATS\["min"\]else:target=1\.2\*LAG\_STATS\["max"\]fori,laginenumerate\(LAG\_FEATURES\):decay=1\.0\-i\*0\.05noise=rng\.uniform\(\-5,5\)poisoned\[lag\]=target\*decay\+noisereturnpoisonedX\_ood=poisoned\_df\[FEATURES\]\.valuespoisoned\_df\["prediction\_ood"\]=model\.predict\(X\_ood\)shap\_ood=shap\.TreeExplainer\(model\)\.shap\_values\(X\_ood\)

## Appendix BNLE Generation Prompt

The system prompt below is used identically for both generators \(zero\-shot,τ=1\.0\\tau=1\.0\)\. The human message provides instance data using the template that follows\.

System Prompt \(NLE generation\)Interpret the time\-series forecasting context that follows\.Your goal is to help a non\-technical lay user\. The user does not have a background in statistical models, machine learning, time\-series or explainability methods \(e\.g\. SHAP, LIME\)\. The user needs to understand•whythe model produced this forecast, and•how much confidencecan be placed in it\.OUTPUT RULES•Writeup to six bullet points\-\- no more\.•Keep the whole response≤\\leq200 words\.•Use plain language, but it is fine to employ key time\-series terms such aslag,trend,seasonality,baseline,error\.•Do not reveal code\.

Human Message Template \(NLE generation\)The following is about time series data with a single\-step ahead prediction, where the model predicts the next value in the time series based on previous observations\. Data Domain: Energy Consumption Dataset Description: \- 2,075,259 measurements from a house in Sceaux, France \(near Paris\), Dec 2006\-\-Nov 2010\. \- Recorded at one\-minute resolution, then resampled weekly\. \- Lag\_1\.\.Lag\_7, ISO week number and number of public holidays per week were added as features\. \- Target: weekly global active power \(kWh\)\. Model Used: XGBRegressor Model Performance: MAE 20\.550, RMSE 25\.035, R20\.686 Prediction: \{prediction\} Instance Features or Context: lag\_1: \{value\}, lag\_2: \{value\}, \.\.\., lag\_7: \{value\}, weekofyear: \{value\}, holiday\_week\_count: \{value\} SHAP values \(sorted by\|\|SHAP\|\|descending\): \{feature\}: \{shap\_value\}, \.\.\. The expected/base value for SHAP: \{base\_value\}

#### Example output\.

For the test instance of 2009\-11\-09 \(ISO week 46\), XGBoost predicts 175\.44 kWh \(true value 208\.20 kWh; error−15\.7%\-15\.7\\%\)\. GPT\-4o \(zero\-shot,τ=1\.0\\tau=1\.0\) produces:

•Why the prediction is 175\.44:The model considers recent weekly consumption from Lag\_1 to Lag\_7 and the time of year\. The week of year \(\+24\.0\+24\.0\) raised the prediction; Lag\_6 \(−13\.5\-13\.5\) and Lag\_1 \(−11\.8\-11\.8\) lowered it\.•Influences:Seasonal factors elevated the forecast slightly, while decreases in recent usage \(Lag\_6, Lag\_1\) weighted the prediction lower\.•Baseline:The model starts from a baseline of 177\.37 kWh and adjusts based on per\-feature contributions\.•Reliability:R2=0\.69R^\{2\}=0\.69suggests the model captures about 69% of the variance, indicating reasonably solid reliability\.•Error margin:Past MAE = 20\.55 and RMSE = 25\.03 suggest weekly consumption may differ by this range; moderate uncertainty\.•Confidence:You can reasonably trust the prediction; consider a±25\\pm 25kWh margin\.

## Appendix CJudge Prompts

All judge prompts use temperatureτ=0\.0\\tau=0\.0for deterministic evaluation and output structured JSON\. The system prompts below are task\-specific; the human message that accompanies them contains the instance data and, depending on the condition, the SHAP attributions, the performance metrics, and the NLE\. A complete worked example – system prompt plus human message – for the E\+X\+T condition of E1 is shown at the end of this section\.

### C\.1E1 and E2: Closeness Task

The closeness task is shared between E1 \(forward simulatability across five information conditions\) and E2 \(placebic control\)\. The judge system prompt is identical; the human message differs only in which condition’s information is included\.

System Prompt \(E1 / E2\)You are evaluating a machine learning model’s prediction accuracy for a time series forecasting task\.You have limited background in machine learning and statistics \-\- you understand basic concepts but are not an expert\.Your task: Based on all the information provided, assess how large the prediction error is likely to be\. Carefully evaluate the features, model context, and any additional information given to form your judgment\.Important: Think carefully about each specific instance\. Consider all the details provided \-\- feature values, patterns, and any explanations\. Do not default to a ‘‘safe’’ middle\-ground answer\. The prediction has a specific error magnitude; use the evidence to determine what it is\.Evaluation steps \(think through each before giving your final answer\): 1\. Examine the features \-\- consider the lag values, the time of year \(season in France\), holidays, and whether the prediction seems reasonable given this context\. 2\. Compare the prediction to recent lag values \-\- does it follow the recent trend? 3\. If model performance metrics are provided, use them to estimate a typical error range\. 4\. If SHAP values are provided, consider whether each feature’s contribution makes sense given the feature values and context\. 5\. If a natural language explanation is provided, you can use it to better understand the model’s reasoning\. 6\. Weigh all available evidence to determine the most likely error magnitude\. Error bucket definitions \(based on absolute percentage error\): \- small:\[0%,5%\)\[0\\%,\\,5\\%\) \- medium:\[5%,15%\)\[5\\%,\\,15\\%\) \- large:\[15%,30%\)\[15\\%,\\,30\\%\) \- very\_large:\[30%,\+∞\)\[30\\%,\\,\+\\infty\) You must respond with EXACTLY this JSON format and nothing else: \{ "error\_bucket": "<small\|\|medium\|\|large\|\|very\_large\>", "confidence": <integer 1\-\-5\>, "reasoning": "<one sentence\>" \}

### C\.2E3: Counterfactual Simulatability

System Prompt \(E3\)You are evaluating how a machine learning model’s prediction would change if certain input features were modified\.You have limited background in machine learning and statistics \-\- you understand basic concepts but are not an expert\.Your task: Given the original instance and a proposed feature change, predict whether the model’s prediction will be HIGHER, LOWER, or SIMILAR compared to the original prediction\.Evaluation steps: 1\. Identify which features are being changed and note their current SHAP values\. 2\. For each changed feature, determine the direction of change\. 3\. Consider how the SHAP contribution might change\. 4\. If multiple features are changed, think about their combined effect\. 5\. If a natural language explanation is provided, you can use it to better understand the feature\-\-prediction relationships\. 6\. Estimate the net directional effect and whether it exceeds the 5% threshold\. Direction definitions: \- higher: prediction increases by more than 5% relative to the original\. \- lower: prediction decreases by more than 5% relative to the original\. \- similar: prediction changes by less than 5% in either direction\. JSON response: \{ "direction": "<higher\|\|lower\|\|similar\>", "confidence": 1\-\-5, "reasoning": "\.\.\." \}

### C\.3E4: Mental Model Transfer

System Prompt \(E4\)You are evaluating a machine learning model’s prediction accuracy for a time series forecasting task\.You have limited background in machine learning and statistics \-\- you understand basic concepts but are not an expert\.You will first see several example predictions with their KNOWN error outcomes\. Study these examples carefully to understand the model’s behavior patterns\. Then you will evaluate a NEW prediction\.Your task: Based on the patterns you observed in the examples and the information about the new instance, classify the new prediction’s likely error bucket\.Error bucket definitions: same as E1/E2\. JSON response: \{ "error\_bucket": "\.\.\.", "confidence": 1\-\-5, "reasoning": "\.\.\." \}

### C\.4E5: Selective Reliance \(OOD Detection\)

System Prompt \(E5\)You are evaluating a machine learning model’s prediction for a time series forecasting task\.You have limited background in machine learning and statistics \-\- you understand basic concepts but are not an expert\.Your task: Based on all the information provided, decide whether this prediction is reliable or unreliable\. Consider whether the input features look normal or unusual, whether they match expected patterns for the time of year, and whether the model can be trusted given these inputs\.Evaluation steps: 1\. Examine the features \-\- consider the lag values, the time of year \(season in France\), holidays, and whether the prediction seems reasonable given this context\. 2\. If SHAP values are provided, consider whether each feature’s contribution makes sense given the feature values and context\. 3\. Consider the model performance metrics to gauge general trustworthiness\. 4\. If a natural language explanation is provided, you can use it to better understand the model’s reasoning\. 5\. Weigh all evidence to decide whether this prediction can be trusted\. Reliability: \- reliable = inputs look normal and within what the model was trained on\. \- unreliable = inputs look unusual or far from what the model was trained on\. JSON response: \{ "reliability": "<reliable\|\|unreliable\>", "confidence": 1\-\-5, "reasoning": "\.\.\." \}

### C\.5Worked Judge Human Message \(E1, E\+X\+T\)

To make the data format the judge actually receives concrete, we reproduce below the complete human message for one judgment: the test instance of 2009\-11\-09 \(ISO week 46\) under E1’s full E\+X\+T condition \(features \+ SHAP \+ metrics \+ NLE\)\. The same instance is used in the NLE example of Appendix[B](https://arxiv.org/html/2605.26770#A2); the NLE text below is the GPT\-4o output shown there\. Other conditions strip the corresponding blocks \(e\.g\., the X condition removes the Metrics block and the NLE block\); other experiments substitute their own task framing while retaining the same data\-block layout\.

Human Message Example \(E1, E\+X\+T condition\)The following is about time series data with a single\-step ahead prediction, where the model predicts the next value in the time series based on previous observations\. Data Domain: Energy Consumption Dataset Description: \-\- 2,075,259 measurements from a house in Sceaux, France \(near Paris\), Dec 2006\-\-Nov 2010\. \-\- Recorded at one\-minute resolution, then resampled weekly\. \-\- Lag\_1\.\.Lag\_7, ISO week number and number of public holidays per week were added as features\. \-\- Target: weekly global active power \(kWh\)\. Model Used: XGBRegressor Model Performance: \-\- MAE: 20\.550 \-\- RMSE: 25\.035 \-\- R2: 0\.686 Prediction: 175\.44 Instance Features or Context: lag\_1: 184\.727, lag\_2: 216\.379, lag\_3: 197\.242, lag\_4: 191\.376, lag\_5: 166\.269, lag\_6: 162\.277, lag\_7: 189\.772, weekofyear: 46, holiday\_week\_count: 0 SHAP values \(sorted by\|\|magnitude\|\|, descending\): weekofyear:\+\+24\.032 lag\_6:−\-13\.501 lag\_1:−\-11\.754 lag\_2:\+\+3\.363 lag\_7:−\-3\.246 lag\_4:−\-2\.308 lag\_5:\+\+1\.771 lag\_3:−\-0\.263 holiday\_week\_count:−\-0\.019 The expected/base value for SHAP: 177\.366 Natural language explanation of this prediction: \-\- Why the prediction is 175\.44: The model considers recent weekly consumption from Lag\_1 to Lag\_7 and the time of year\. The week of year \(\+\+24\.0\) raised the prediction; Lag\_6 \(−\-13\.5\) and Lag\_1 \(−\-11\.8\) lowered it\. \-\- Influences: Seasonal factors elevated the forecast slightly, while decreases in recent usage \(Lag\_6, Lag\_1\) weighted the prediction lower\. \-\- Baseline: The model starts from a baseline of 177\.37 kWh and adjusts based on per\-feature contributions\. \-\- Reliability: R2=0\.69=0\.69suggests the model captures about 69% of the variance, indicating reasonably solid reliability\. \-\- Error margin: Past MAE=20\.55=20\.55and RMSE=25\.03=25\.03suggest weekly consumption may differ by this range; moderate uncertainty\. \-\- Confidence: You can reasonably trust the prediction; consider a±25\\pm 25kWh margin\. Based on all the information above, classify this prediction’s likely error bucket\.

The true weekly consumption for this instance was 208\.20 kWh, giving an absolute percentage error of 15\.7% \(the*large*bucket\)\.

## Appendix DFull Statistical Results

#### Modelling framework\.

All experiments share a within\-instance repeated\-measures design\. Accuracy: GLMM \(binomial, logit link\) vialme4\(Bates et al\.,[2015](https://arxiv.org/html/2605.26770#bib.bib6)\), fixed effects condition\+\+judge \(\+\+interaction where convergence allows\), random intercept by instance\. Optimiser: bobyqa\. Confidence: CLMM \(cumulative logit\) viaordinal\(Christensen,[2019](https://arxiv.org/html/2605.26770#bib.bib14)\)\. Overconfidence: CLMM with correctness×\\timescondition interaction\. Omnibus tests: likelihood\-ratio tests \(LRT\) on nested models\. Planned contrasts: estimated marginal means with Holm correction\(Lenth,[2024](https://arxiv.org/html/2605.26770#bib.bib33)\)\. Effect size: odds ratios \(OR\) with 95% CIs\. Bayesian analyses:brms\(Bürkner,[2017](https://arxiv.org/html/2605.26770#bib.bib10)\),𝒩​\(0,1\.5\)\\mathcal\{N\}\(0,\\,1\.5\)priors on fixed effects\(Gelman et al\.,[2008](https://arxiv.org/html/2605.26770#bib.bib18)\), half\-t3​\(0,2\.5\)t\_\{3\}\(0,\\,2\.5\)on random\-intercept SD, 4 chains×\\times4000 iterations\. ROPE on log\-odds scale:\[−0\.18,\+0\.18\]\[\-0\.18,\\,\+0\.18\]\(OR∈\[0\.84,1\.20\]\\text\{OR\}\\in\[0\.84,\\,1\.20\]\)\.

This appendix reports the full statistical output for the Part 2 analyses described in §[3\.3](https://arxiv.org/html/2605.26770#S3.SS3)\. Each experiment is presented with a common structure: \(1\) descriptives, \(2\) GLMM fixed effects and model fit, \(3\) CLMM summary for confidence, \(4\) calibration analysis \(mixed\-effects model, Somers’DD, empirical calibration, conditional confidence\), \(5\) model diagnostics, \(6\) Bayesian ROPE, and \(7\) sensitivity analyses \(judge\-specific, generator, same\-family, random slopes, and experiment\-specific covariates\)\. Where the model structure permits, estimated marginal means and a planned\-contrasts / pairwise\-comparisons table are also reported – specifically, for experiments with three or more conditions \(E1, E2\)\. Experiments with only two conditions \(E3, E4\) or a2×22\\times 2interaction design \(E5\) do not require a separate pairwise table because the fixed\-effect coefficient in the primary GLMM is itself the pairwise contrast \(or the interaction term is the relevant simple\-effects decomposition\)\. All primary models use the interaction specification of the primary GLMM\.

#### Reading the diagnostics line\.

Each experiment reports a one\-line diagnostics summary of the form “Convergence: OK\. Singularity: OK\. DHARMa uniformityp=⋅p=\\cdot\. Dispersionp=⋅p=\\cdot\.” The four checks mean: \(i\) the maximum\-likelihood optimisation converged without warnings; \(ii\) the random\-intercept variance is not estimated on the boundary \(non\-zero, not a singular fit\); \(iii\) the DHARMa simulated\-residual distribution is uniform \(a largepp\-value is*good*– it indicates no evidence of miscalibration\); and \(iv\) the dispersion test detects no over\- or under\-dispersion of the binomial residuals \(again, largep=p=no problem\)\.

#### Reading “NA” rows in empirical calibration tables\.

Some tables include an “NA” row under*Confidence level*\. These are judgments where the confidence rating could not be parsed from the judge’s response \(e\.g\., the judge wrote a response that did not match the expected 1–5 integer format\)\. All such judgments are retained in the primary model \(viaconfidence\_numtreated as continuous when the response parses, and treated as missing when it does not\) but tabulated separately so their accuracy rate – which is essentially zero because these responses also typically fail to commit to a valid outcome bucket – is visible rather than hidden\.

### D\.1Experiment 1: Forward Simulatability

#### Descriptives

Table 8:E1 descriptives\. True bucket distribution across the 60 test instances: small 12, medium 29, large 16, very\-large 3 \(192, 348, 144, 36 judgments respectively\)\.
#### GLMM Fixed Effects \(Accuracy\)

Table 9:E1 GLMM fixed effects\. Reference: Baseline, DeepSeek\-R1\. Omnibus LRT:χ2​\(8\)=8\.23\\chi^\{2\}\(8\)=8\.23,p=\.41p=\.41\.N=720N=720, 60 instance groups\. No judge×\\timescondition interaction is significant\.
#### Estimated Marginal Means \(Accuracy\)

Table 10:E1 EMMs on the probability scale, marginal and by judge\. Full 5\-by\-3 grid \(no selected subset\)\.
#### Planned Contrasts and All Pairwise

ContrastORSEzzppPlanned:E\+X\+T vs X\+T \(NLE marginal effect\)0\.830\.25−\-0\.63\.530X\+T vs Baseline \(information effect\)1\.550\.54\+\+1\.26\.208All pairwise \(Holm\-corrected, allpHolm=1\.0p\_\{\\text\{Holm\}\}=1\.0\):Baseline vs X0\.8250\.283−\-0\.561\.0Baseline vs T0\.7850\.273−\-0\.701\.0Baseline vs X\+T0\.6460\.224−\-1\.261\.0Baseline vs E\+X\+T0\.7800\.234−\-0\.831\.0X vs T0\.9510\.328−\-0\.151\.0X vs X\+T0\.7830\.269−\-0\.711\.0X vs E\+X\+T0\.9460\.279−\-0\.191\.0T vs X\+T0\.8230\.287−\-0\.561\.0T vs E\+X\+T0\.9940\.300−\-0\.021\.0X\+T vs E\+X\+T1\.2070\.362\+\+0\.631\.0

Table 11:E1 condition contrasts on accuracy\. No pair survives Holm correction\.
#### CLMM \(Confidence\)

Table 12:E1 CLMM on confidence \(N=625N=625non\-missing\)\. Reference: Baseline, DeepSeek\-R1\. Huge positive shifts for T, X\+T, E\+X\+T indicate information\-scaled confidence inflation\.Table 13:E1 confidence pairwise comparisons on the log\-odds scale \(Holm\-corrected\)\. Eight of ten pairs differ atp<\.0001p<\.0001\. The information\-bearing conditions \(X\+T, E\+X\+T\) sit far above Baseline/X/T; the NLE does not add a further significant increment over X\+T \(last row,p=\.62p=\.62\)\.
#### Calibration

Table 14:E1 calibration model:correct∼confidence×condition\+judge\+\(1∣instance\)\\text\{correct\}\\sim\\text\{confidence\}\\times\\text\{condition\}\+\\text\{judge\}\+\(1\\mid\\text\{instance\}\)\. The confidence×\\timesE\+X\+T interaction \(p=\.008p=\.008\) is the only significant calibration slope difference: in the full\-NLE condition, higher confidence tracks accuracy better than at Baseline\. The main confidence coefficient is the reference \(Baseline\) slope\.Table 15:E1 Somers’DD\(confidence–accuracy ordinal association\)\. NegativeDDindicates that higher stated confidence is associated with*lower*accuracy\. Baseline shows the worst calibration; the full\-NLE condition is the only one with \(very weakly\) positive calibration\.Table 16:E1 empirical calibration: observed accuracy at each stated confidence level, per condition\. A perfectly calibrated rater shows monotonically increasing accuracy with confidence; in Baseline the pattern is inverted \(confidence 4 has lower accuracy than confidence 3\)\.Table 17:E1 mean confidence split by correctness\. Confidence when wrong rises from 3\.46 \(Baseline\) to 3\.90 \(E\+X\+T\) – wrong answers become more confident with more information, the definition of overconfidence\.
#### Diagnostics

Convergence: OK\. Singularity: OK\. DHARMa uniformityp=\.179p=\.179\. Dispersionp=\.640p=\.640\. No violations\.

#### Bayesian ROPE

Table 18:E1 Bayesian posterior ORs for all fixed effects\. ROPE=\[−0\.18,\+0\.18\]=\[\-0\.18,\+0\.18\]on log\-odds⇔OR∈\[0\.84,1\.20\]\\Leftrightarrow\\text\{OR\}\\in\[0\.84,1\.20\]\. No condition reaches the 95% ROPE threshold, but none excludes 1 either – the posteriors are consistent with small, non\-significant effects\.
#### Sensitivity

#### Judge\-specific\.

Stats\-plan item: refit the primary GLMM separately on the GPT\-4o and DeepSeek\-R1 subsets to verify that the marginal effect is stable across raters rather than driven by one judge\. Per\-judge model:correct∼condition\+\(1∣instance\)\\text\{correct\}\\sim\\text\{condition\}\+\(1\\mid\\text\{instance\}\)\.

Table 19:E1 judge\-specific ORs \(from separate GLMMs per judge\)\. Neither judge’s condition effects reach significance; they disagree in direction only for T and X\+T, consistent with the non\-significant judge×\\timescondition interactions in the primary GLMM \(Table[9](https://arxiv.org/html/2605.26770#A4.T9)\)\.
#### Generator and same\-family\.

Fitted on NLE rows only \(N=240N=240\): intercept−0\.350\-0\.350\(SE0\.4780\.478,p=\.46p=\.46\);generatorGPT\-4o\{\}\_\{\\text\{GPT\-4o\}\}\+0\.282\+0\.282\(SE0\.3770\.377,p=\.455p=\.455\); judge effect−1\.514\-1\.514\(SE0\.4130\.413,p=\.0002p=\.0002\)\. Same\-family bias:OR=1\.00\\text\{OR\}=1\.00,p=1\.0p=1\.0\. Which LLM wrote the NLE does not affect accuracy; the dominant structured effect is judge identity\.

#### Random slopes\.

The modelcorrect∼condition×judge\+\(condition∣instance\)\\text\{correct\}\\sim\\text\{condition\}\\times\\text\{judge\}\+\(\\text\{condition\}\\mid\\text\{instance\}\)is singular, confirming that the intercept\-only random structure is sufficient \(\)\.

### D\.2Experiment 2: Placebic Control

#### Descriptives

Table 20:E2 descriptives\. By judge: DeepSeek\-R1 – Baseline \.533/3\.62, Real \.525/3\.90, Placebo \.517/3\.71; GPT\-4o – Baseline \.367/4\.00, Real \.433/4\.00, Placebo \.400/4\.09\.
#### GLMM Fixed Effects \(Accuracy\)

Table 21:E2 GLMM fixed effects\. Reference: Baseline, DeepSeek\-R1\. Omnibus LRT:χ2​\(4\)=1\.55\\chi^\{2\}\(4\)=1\.55,p=\.82p=\.82\.N=600N=600, 60 instance groups\. No interaction significant\.
#### Estimated Marginal Means \(Accuracy\)

Table 22:E2 EMMs on the probability scale, marginal and by judge\.
#### Planned Contrasts and All Pairwise

Table 23:E2 planned and pairwise condition contrasts \(Holm\-corrected\)\. The critical Real\-vs\-Placebo contrast \(p=1\.0p=1\.0\) shows that real and placebo NLEs produce indistinguishable accuracy; see also the Bayesian equivalence analysis below\.
#### CLMM \(Confidence\)

The interaction CLMM produces NaN standard errors because the random\-intercept variance for confidence collapses to zero \(confidence does not vary meaningfully between instances\)\. Fixed\-effect point estimates are still informative\. The additive model is used as the primary fit\.

Table 24:E2 CLMM \(confidence\)\. SEs suppressed because the random\-intercept variance estimate is essentially zero; the additive model below is used for significance testing\.Table 25:E2 confidence pairwise contrasts from the additive CLMM\. Real NLEs significantly elevate confidence relative to Baseline; placebo NLEs do so marginally; real and placebo are not significantly distinguishable, consistent with the presence\-not\-content hypothesis\.
#### Calibration

Table 26:E2 calibration model:correct∼confidence×condition\+judge\+\(1∣instance\)\\text\{correct\}\\sim\\text\{confidence\}\\times\\text\{condition\}\+\\text\{judge\}\+\(1\\mid\\text\{instance\}\)\. Neither confidence×\\timesReal nor confidence×\\timesPlacebo interaction is significant: the confidence–accuracy mapping does not differ between real and placebo NLEs\.Table 27:E2 Somers’DD\. All near zero – poor calibration throughout, with no difference between real and placebo\.Table 28:E2 empirical calibration: accuracy at each stated confidence level, per condition\. Neither real nor placebo NLEs produce a monotonically increasing calibration curve\.Table 29:E2 mean confidence split by correctness\. Confidence is essentially identical for correct and incorrect judgments within every condition – confidence is decoupled from accuracy throughout\.
#### Diagnostics

Convergence: OK\. Singularity: OK\. DHARMa uniformityp=\.605p=\.605\. Dispersionp=\.496p=\.496\. No violations\.

#### Bayesian ROPE and Real\-vs\-Placebo Equivalence

Table 30:E2 Bayesian ROPE analysis\. Both NLE conditions have posterior ORs close to 1 with wide CrIs that include 1\. The direct Real\-vs\-Placebo equivalence test yields 43\.1% inside ROPE: insufficient for formal practical equivalence \(≥95%\{\\geq\}95\\%\), but consistent with no meaningful difference between real and placebo NLEs\.
#### Sensitivity

#### Judge\-specific\.

Stats\-plan item: refit the primary GLMM separately per judge to verify that the marginal condition effects are stable across raters\. Per\-judge model:correct∼condition\+\(1∣instance\)\\text\{correct\}\\sim\\text\{condition\}\+\(1\\mid\\text\{instance\}\)\.

Table 31:E2 judge\-specific ORs\. DeepSeek\-R1’s per\-judge fit is near\-singular \(extreme intercept\) but directionally consistent with GPT\-4o: neither judge shows a significant NLE effect on accuracy\.
#### Generator and same\-family\.

Fitted on NLE rows \(N=480N=480, 2 conditions\):

Table 32:E2 generator and same\-family tests\. No generator effect\. Same\-family bias is marginal \(p=\.082p=\.082\) but does not survive adjustment for multiple tests\.
#### Random slopes\.

Singular fit; intercept\-only model retained\.

#### Derangement caveat\.

Placebo NLEs were constructed using a single random derangement of the 60 instances \(each instance receives an NLE generated for a different instance, with the constraint that no instance receives its own\)\. A different derangement would produce slightly different placebo pairings; results are expected to be stable given the largeNNbut the dependence on one specific assignment is acknowledged\.

### D\.3Experiment 3: Counterfactual Simulatability

#### Descriptives

Table 33:E3 descriptives\. True\-direction distribution across the 60 test instances: higher 28 \(168 judgments\), lower 24 \(144\), similar 8 \(48\)\. By judge: E\+X/DeepSeek \.617/4\.05, E\+X/GPT\-4o \.542/3\.95; X/DeepSeek \.600/3\.95, X/GPT\-4o \.467/3\.98\.
#### GLMM Fixed Effects \(Accuracy\)

Table 34:E3 GLMM fixed effects\. Reference: E\+X, DeepSeek\-R1\.N=360N=360, 60 instance groups\. No interaction significant\.
#### Estimated Marginal Means \(Accuracy\)

Table 35:E3 EMMs on the probability scale\. NLE point estimates are higher for both judges but wide CIs span the SHAP\-only values\.
#### Confusion Matrices

Table 36:E3 confusion matrices for both conditions \(rows: predicted; columns: true\)\. Per\-class recall \(diagonal / column sum\): see next table\.Table 37:E3 per\-class recall\. The “similar” class \(small perturbation, predicted output barely changes\) is substantially harder for both conditions; neither NLE nor SHAP\-only reaches even 25% recall\.
#### CLMM \(Confidence\)

Table 38:E3 CLMM on confidence\. Confidence is near ceiling in both conditions; no coefficient reaches significance\.
#### Calibration

Table 39:E3 calibration model\. The mainconfidence\_numcoefficient \(the E\+X calibration slope\) is positive and significant – higher confidence genuinely predicts accuracy under NLE\. The conf×\\timesX interaction is marginal \(p=\.095p=\.095\), suggesting the SHAP\-only condition has a weaker slope \(theDD\-values corroborate this\)\.Table 40:E3 Somers’DD\. Both conditions show positive calibration; the NLE condition is substantially more informative per unit of confidence\.Table 41:E3 empirical calibration\. The E\+X curve rises monotonically \(24% at conf 3→\\to71% at conf 5\); the X\-only curve is flat from conf 4 to conf 5\.Table 42:E3 mean confidence split by correctness\. In both conditions, correct judgments carry higher confidence than incorrect – the positive calibration story\.
#### Diagnostics

Convergence: OK\. Singularity: OK\. DHARMa uniformityp=\.927p=\.927\. Dispersionp=\.928p=\.928\. No violations\.

#### Bayesian ROPE

Table 43:E3 Bayesian posterior ORs\. The Condition X posterior’s CrI spans 1; 27\.9% of its mass is inside the ROPE, consistent with a small, non\-significant negative effect on accuracy relative to E\+X\.
#### Sensitivity

#### Judge\-specific\.

Stats\-plan item: refit the primary GLMM separately per judge to verify that the condition effect is stable across raters\. Per\-judge model:correct∼condition\+\(1∣instance\)\\text\{correct\}\\sim\\text\{condition\}\+\(1\\mid\\text\{instance\}\)\.

Table 44:E3 judge\-specific condition ORs\. Neither judge’s effect is individually significant; both directions agree \(NLE slightly reduces the odds of a correct direction prediction relative to SHAP\-only\)\.
#### Generator and same\-family\.

Fitted on NLE rows \(N=240N=240\):

Table 45:E3 generator and same\-family tests\. No generator effect; no significant same\-family bias\.
#### Random slopes\.

Singular fit; intercept\-only model retained\.

### D\.4Experiment 4: Mental Model Transfer

#### Descriptives

Table 46:E4 descriptives across the 55 sliding\-window test positions\. True\-bucket distribution: small 72, medium 150, large 90, very\-large 18 judgments\. By judge: Baseline/DeepSeek \.309/3\.40, Baseline/GPT\-4o \.418/3\.75; E/DeepSeek \.400/3\.40, E/GPT\-4o \.409/3\.75\.
#### GLMM Fixed Effects \(Accuracy\)

Table 47:E4 GLMM fixed effects\. Reference: Baseline, DeepSeek\-R1\. Omnibus LRT:χ2​\(2\)=4\.86\\chi^\{2\}\(2\)=4\.86,p=\.089p=\.089\.N=330N=330, 55 instance groups\. The extreme random\-intercept variance reflects large between\-instance difficulty variation specific to this experiment’s sliding\-window setup\. The judge×\\timescondition interaction is marginal \(p=\.089p=\.089\), indicating opposing patterns across judges\.
#### Estimated Marginal Means \(Accuracy\)

Table 48:E4 EMMs on the probability scale\. Standard errors and confidence intervals are not reported because the extreme instance\-level random intercept \(SD=5\.36=5\.36, variance=28\.72=28\.72\) makes them uninformative at these near\-floor probability levels\. The judge\-specific EMMs reveal the key asymmetry: DeepSeek\-R1’s accuracy improves from 3\.3% to 13\.0% with NLE training; GPT\-4o’s slightly decreases \(16\.5%→\\to14\.7%\)\.
#### CLMM \(Confidence\)

Table 49:E4 CLMM on confidence\. The condition effect is virtually zero \(p=\.96p=\.96\); the only meaningful coefficient is judge \(GPT\-4o reports systematically higher confidence\)\. Training\-phase NLE exposure has no residual effect on confidence reported at test time\.
#### Calibration

Table 50:E4 calibration model\. The significant negative mainconfidence\_numslope – in both conditions per Somers’DDbelow – indicates that higher stated confidence actually predicts*lower*accuracy\. Training\-phase NLE exposure does not change this \(interactionp=\.46p=\.46\)\.Table 51:E4 Somers’DD\. Both negative – inverted calibration: higher stated confidence predicts*lower*accuracy in both conditions\.Table 52:E4 empirical calibration\. Both conditions show declining accuracy from conf 3 to conf 4 – the inversion captured by negative Somers’DD\.Table 53:E4 mean confidence split by correctness\. Incorrect judgments carry*higher*confidence than correct – again, the inversion\.
#### Diagnostics

Convergence: OK\. Singularity: OK\. DHARMa uniformityp=\.152p=\.152\. Dispersionp=\.504p=\.504\. No violations\.

#### Bayesian ROPE

Table 54:E4 Bayesian posterior ORs\. Condition E’s posterior favours a positive effect but with wide uncertainty \(CrI\[0\.74,4\.56\]\[0\.74,4\.56\]includes 1\); only 12\.8% of the posterior is inside the ROPE, so the Bayesian result is also indeterminate\.
#### Sensitivity

#### Judge\-specific\.

Stats\-plan item: refit the primary GLMM separately per judge to verify that the training\-phase NLE effect is stable across raters\. Per\-judge model:correct∼condition\+\(1∣instance\)\\text\{correct\}\\sim\\text\{condition\}\+\(1\\mid\\text\{instance\}\)\.

Table 55:E4 judge\-specific condition ORs\. The two judges disagree in direction: DeepSeek\-R1’s extreme estimate reflects near\-separation \(its Baseline accuracy is close to zero\), while GPT\-4o’s null is stable\. This asymmetry drives the marginalp=\.089p=\.089judge×\\timescondition interaction in the primary model\.
#### Generator and same\-family\.

Fitted on NLE rows \(N=220N=220\):

Table 56:E4 generator and same\-family tests\. Neither effect is significant\.
#### Random slopes\.

Singular fit; intercept\-only model retained\.

#### Trial\-order covariate \(learning\-over\-time check\)\.

Stats\-plan item: test whether judges improve with more experience\.

Table 57:E4 trial\-order test\. No main effect of position in the sequence; the marginal interaction \(p=\.076p=\.076\) hints at differential learning across trials by condition, though the effect is not robust\.
#### Sliding\-window caveat\.

The E4 design uses a sliding window with 80% overlap between adjacent training sets; this induces temporal autocorrelation in the judgments beyond what the instance random intercept absorbs\. Inferential claims for this experiment carry this caveat\.

### D\.5Experiment 5: Selective Reliance

#### GLMM Fixed Effects \(Accuracy\)

#### Descriptives

Table 58:E5 descriptives\. By judge: Baseline/No NLE DeepSeek \.900, GPT\-4o 1\.000; Baseline/NLE DeepSeek \.967, GPT\-4o 1\.000; OOD/No NLE DeepSeek \.550, GPT\-4o \.050; OOD/NLE DeepSeek \.292, GPT\-4o \.008\. The OOD × judge asymmetry is extreme: GPT\-4o almost never flags OOD predictions as unreliable\.
#### GLMM Fixed Effects \(Accuracy\)

Table 59:E5 GLMM fixed effects\. Reference: Baseline, No NLE, DeepSeek\-R1\. Two\-way interaction LRT:χ2​\(1\)=9\.84\\chi^\{2\}\(1\)=9\.84,p=\.002p=\.002\. Three\-way with judge:χ2​\(3\)=58\.78\\chi^\{2\}\(3\)=58\.78,p=1\.07×10−12p=1\.07\\times 10^\{\-12\}\(full\-model AIC==341\.8 vs two\-way AIC==394\.6\); judge asymmetry is pronounced\.N=720N=720\.
#### Simple Effects

Table 60:E5 simple\-effects decomposition of the significant poisoning×\\timesNLE interaction\. The reversed direction of the NLE effect between Baseline and OOD – help to harm – is the interaction\.
#### CLMM \(Confidence\)

Table 61:E5 CLMM on confidence\. Poisoning decreases confidence slightly \(the OOD prediction is rated slightly less confidently\), but the NLE main effect and interaction on confidence are non\-significant: confidence is essentially uniform \(3\.82–3\.99\) across all four conditions\. The poisoning×\\timesNLE interaction LRT on confidence isχ2​\(1\)=0\.52\\chi^\{2\}\(1\)=0\.52,p=\.47p=\.47\.
#### Calibration

EffectEstimateSEzzppIntercept−\-10\.8915\.544−\-1\.96\.049confidence\_num\+3\.889\\mathbf\{\+3\.889\}1\.444\+2\.69\\mathbf\{\+2\.69\}\.007\\mathbf\{\.007\}Baseline, NLE−\-10\.9787\.310−\-1\.50\.133OOD, No NLE\+18\.447\\mathbf\{\+18\.447\}6\.012\+3\.07\\mathbf\{\+3\.07\}\.002\\mathbf\{\.002\}OOD, NLE\+16\.810\\mathbf\{\+16\.810\}5\.831\+2\.88\\mathbf\{\+2\.88\}\.004\\mathbf\{\.004\}Judge: GPT\-4o−\-1\.9280\.403−\-4\.79<<\.001conf×\\timesBaseline, NLE\+\+3\.3391\.968\+\+1\.70\.090conf×\\timesOOD, No NLE−5\.931\\mathbf\{\-5\.931\}1\.560−3\.80\\mathbf\{\-3\.80\}\.0001\\mathbf\{\.0001\}conf×\\timesOOD, NLE−5\.778\\mathbf\{\-5\.778\}1\.516−3\.81\\mathbf\{\-3\.81\}\.0001\\mathbf\{\.0001\}Random intercept SD0\.68

Table 62:E5 calibration model\. In Baseline, higher confidence predicts correctness \(positive main slope\)\. In OOD, both conf×\\timesOOD interactions are large and negative – confidence is*inversely*related to accuracy when the input is poisoned\. NLE presence does not change the calibration slope in OOD \(both OOD conditions have near\-identical interaction coefficients\)\.Table 63:E5 Somers’DD\. Baseline conditions show positive calibration; OOD conditions show inverted calibration – confidence predicts being*wrong*\. NLE presence does not change the sign\.Table 64:E5 empirical calibration\. In OOD conditions, stated confidence 4 corresponds to only 10–17% accuracy, while confidence 3 corresponds to 58–84% accuracy – strong inversion\. NLE makes this slightly worse \(NLE\-conf\-4 has 9\.8% accuracy versus No\-NLE\-conf\-4 at 17\.3%\)\.Table 65:E5 mean confidence split by correctness\. In OOD, confidence when*wrong*exceeds confidence when*right*– the diagnostic fingerprint of the rationalisation mechanism\.
#### Diagnostics

Convergence: OK\. Singularity: OK\. DHARMa uniformityp=\.341p=\.341\. Dispersionp=\.368p=\.368\. No violations\.

#### Bayesian ROPE

Table 66:E5 Bayesian posterior ORs\. The OOD×\\timesNLE interaction is the only Part 2 effect whose Bayesian CrI excludes 1 – positive evidence for a genuine interaction, in the direction of harm \(NLE makes OOD detection worse\)\.
#### Sensitivity

#### Judge\-specific\.

Stats\-plan item: refit the primary GLMM separately per judge to verify that the poisoning×\\timesNLE interaction is stable across raters\. Per\-judge model:correct∼poisoning×has\_nle\+\(1∣instance\)\\text\{correct\}\\sim\\text\{poisoning\}\\times\\text\{has\\\_nle\}\+\(1\\mid\\text\{instance\}\)\.

Table 67:E5 judge\-specific ORs\. GPT\-4o’s fit is effectively separated \(GPT almost never flags OOD: 0\.8–5% accuracy\), so its coefficients are unreliable in magnitude but directionally consistent with DeepSeek\-R1: both judges show the same interaction direction \(NLE halves OOD detection\)\.
#### Generator\.

Not testable – splitting by poisoning produces too few NLE\-condition rows with balanced generators after filtering\.

#### Random slopes\.

Singular fit; intercept\-only model retained\. The primary model’s random\-intercept variance is already near the boundary \(the descriptives show Baseline accuracy essentially at ceiling\), so richer random structures cannot be identified from this data\.

Similar Articles

LLMs and performative productivity

Lobsters Hottest

A developer reflects on using AI agents and questions whether the apparent productivity gains are genuine or merely performative, noting that while tasks are completed faster, deep understanding and real value may be lost.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.