Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
Summary
Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
Source: [https://arxiv.org/html/2604.19768](https://arxiv.org/html/2604.19768)
[![[Uncaptioned image]](https://arxiv.org/html/2604.19768v1/x1.png)Asim D\. Bakhshi](https://orcid.org/0000-0002-9516-9153) National University of Science and Technology Islamabad, Pakistan 46000 asim\.dilawar@mcs\.edu\.pk
###### Abstract
Large language models \(LLMs\) exhibit systematic miscalibration with rhetorical intensity not proportionate to epistemic grounding\. This study tests this hypothesis and proposes a framework for quantifying this decoupling by designing a triadic epistemic\-rhetorical marker \(ERM\) taxonomy\. The taxonomy is operationalized through composite metrics of form\-meaning divergence \(FMD\), genuine\-to\-performed epistemic ratio \(GPR\), and rhetorical device distribution entropy \(RDDE\)\. Applied to 225 argumentative texts spanning approximately 0\.6 Million tokens across human expert, human non\-expert, and LLM\-generated sub\-corpora, the framework identifies a consistent, model\-agnostic LLM epistemic signature\. LLM\-generated texts produce tricolon at nearly twice the expert rate \(Δ=0\.95\\Delta=0\.95\), while human authors produce erotema at more than twice the LLM rate\. Performed hesitancy markers appear at twice the human density in LLM output\. FMD is significantly elevated in LLM texts relative to both human groups \(p<0\.001,Δ=0\.68p<0\.001,\\Delta=0\.68\), and rhetorical devices are distributed significantly more uniformly across LLM documents\. The findings are consistent with theoretical intuitions derived from Gricean pragmatics, Relevance Theory, and Brandomian inferentialism\. The annotation pipeline is fully automatable, making it deployable as a lightweight screening tool for epistemic miscalibration in AI\-generated content and as a theoretically motivated feature set for LLM\-generated text detection pipelines\.
*Keywords*Large Language Models⋅\\cdotAI Evaluation⋅\\cdotBias and Fairness⋅\\cdotEpistemic Uncertainty
## 1Introduction
Large language model \(LLM\) bias is a multilayered phenomenon that manifests in distinct linguistic and social dimensionsGallegoset al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib1)\); Ranjanet al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib5)\)\. The existing literature predominantly addresses three critical categories: demographic and representational bias involving social hierarchiesBlodgettet al\.\([2020](https://arxiv.org/html/2604.19768#bib.bib36)\), factual inaccuracy often termed hallucinationJiet al\.\([2023](https://arxiv.org/html/2604.19768#bib.bib37)\), and the propagation of toxicityBenderet al\.\([2021](https://arxiv.org/html/2604.19768#bib.bib34)\)\. Although these concerns are vital for model safety, they focus primarily on semantic content, i\.e\., the surface representation layer of LLM responses\. What remains largely unexamined are the structural metacognitive mechanisms by which models position their claimsErhardt \([2025](https://arxiv.org/html/2604.19768#bib.bib28)\)\. Beneath explicit content lies a secondary, more subtle layer of structural bias related to the epistemic and rhetorical posture of generated textLiet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib27)\)\.
This deeper layer concerns the epistemic stance of a text, a concept rooted in linguistic modality that reflects a speaker’s degree of commitment to propositional contentFintel and Gillies \([2007](https://arxiv.org/html/2604.19768#bib.bib6)\); Li and Zhang \([2025](https://arxiv.org/html/2604.19768#bib.bib23)\)\. Epistemic stance is a load\-bearing feature of language that differentiates between settled knowledge, indirect inference, and genuine uncertaintyLeeet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib10)\); Liet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib27)\)\. Rational communication relies on this calibration to signal the reliability of evidence and the source of information\. If LLMs systematically miscalibrate these markers relative to the actual epistemic status of their claims, they introduce a structural bias that distorts how users assess and trust generated content\.
Recent rhetorical analyses have identified that LLMs produce persuasive surface structures, and epistemic stance marking has been studied as an isolated natural language processing \(NLP\) taskClausen \([2010](https://arxiv.org/html/2604.19768#bib.bib30)\); Kramer \([2025](https://arxiv.org/html/2604.19768#bib.bib12)\)\. To the best of our knowledge, however, no existing work has examined the relationship between rhetorical intensity and epistemic calibration as a unified, measurable constructErhardt \([2025](https://arxiv.org/html/2604.19768#bib.bib28)\)\. This gap manifests as a decoupling in which LLMs deploy elaborate rhetorical forms, such as tricolon, correctio, contrastive reframing, etc, independently of whether the claim warrants such intensity or whether epistemic markers represent genuine uncertainty or merely performed hesitancy\. The precise divergence between rhetorical form and epistemic import remains unmeasured\.
To address this gap, we propose an epistemic\-rhetorical marker \(ERM\) taxonomy, a triadic framework distinguishing sentence\-level rhetorical devices, epistemic stance markers, and discourse\-level argumentative structure\. The framework operationalizes metrics that quantify the decoupling between rhetorical intensity and epistemic calibration, validated through a corpus comprising human expert, human non\-expert, and LLM\-generated writing\. The approach is grounded in Gricean pragmaticsGrice \([1975](https://arxiv.org/html/2604.19768#bib.bib38)\), Relevance TheoryWilson and Sperber \([2002](https://arxiv.org/html/2604.19768#bib.bib39)\), and Brandomian inferentialismBrandom \([1994](https://arxiv.org/html/2604.19768#bib.bib40),[1997](https://arxiv.org/html/2604.19768#bib.bib41)\)\.
This study makes three primary contributions\. First, we present a theoretically grounded ERM taxonomy integrating rhetorical device analysis with epistemic modality classification across three levels of linguistic organisation: sentence\-level tropes, lexical and syntactic stance markers, and discourse\-level argumentative structure\. Second, we introduce three novel corpus\-applicable metrics, i\.e\., a form\-meaning divergence \(FMD\) score, genuine\-to\-performed ratio \(GPR\), and rhetorical device distribution entropy \(RDDE\), as novel, corpus\-applicable metrics for measuring epistemic\-rhetorical divergence in argumentative text\. Third, we provide corpus wide empirical validation showing that LLM output exhibits systematically higher divergence and a distinct epistemic marker profile\. Together, these contributions disentangle a dimension of LLM bias that has implications for AI\-generated text evaluation and the computational study of epistemic stance\.
## 2Related Work
Rhetorical analysis, stylometric measurement, and epistemic marker classification have developed largely in isolation\. Their structural relationship remains unmeasured\. The current literature falls into three relevant domains\.
### 2\.1Bias in LLMs: Existing Taxonomies and Their Limits
LLM bias research is predominantly structured around the evaluation of propositional content and surface\-level identifiersGallegoset al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib1)\); Ranjanet al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib5)\); Blodgettet al\.\([2020](https://arxiv.org/html/2604.19768#bib.bib36)\)\. Initial research focused heavily on demographic and representational biases, categorising harms into allocational disparities and representational harms where specific social groups are stereotyped or devaluedRajet al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib42)\); Razaet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib2)\)\. Parallel to this, a robust body of literature addresses factual bias and hallucinationJiet al\.\([2023](https://arxiv.org/html/2604.19768#bib.bib37)\); Kenthapadiet al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib44)\)\. Significant effort has also been directed toward toxicity and harmful output, developing metrics and moderation tools to identify offensive or biased language targeting marginalised communitiesGuoet al\.\([2024](https://arxiv.org/html/2604.19768#bib.bib4)\); Benderet al\.\([2021](https://arxiv.org/html/2604.19768#bib.bib34)\); McKee and Porter \([2020](https://arxiv.org/html/2604.19768#bib.bib13)\)\. Despite their critical importance, these taxonomies do not analyse the structural relationship between rhetorical form and epistemic stance\.
### 2\.2Computational Approaches to Rhetoric and Style
Computational stylistics establishes that individual writing style is a measurable fingerprint detectable through quantitative analysisNealet al\.\([2017](https://arxiv.org/html/2604.19768#bib.bib45)\)\. Stylometry leverages metrics such as Burrows’ Delta to measure the distribution of most frequent words, providing a content\-independent authorship signatureBurrows \([2002](https://arxiv.org/html/2604.19768#bib.bib46)\); Jannidiset al\.\([2015](https://arxiv.org/html/2604.19768#bib.bib47)\)\. This tradition has informed recent work using lexical and syntactic features to distinguish human\-written from LLM\-generated text across various modelsAgrahariet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib19)\); Bisztrayet al\.\([2026](https://arxiv.org/html/2604.19768#bib.bib15)\); Kumarage and Liu \([2023](https://arxiv.org/html/2604.19768#bib.bib17)\); Zaitsuet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib21)\)\. Computational rhetorical analysis has sought to map discourse structure through frameworks such as rhetorical structure theory and the automated detection of complex figures and tropesMajdik and Graham \([2024](https://arxiv.org/html/2604.19768#bib.bib11)\); Erhardt \([2025](https://arxiv.org/html/2604.19768#bib.bib28)\)\. These stylometric tools have more recently been applied to AI\-generated text detection, identifying distinctive patterns such as reduced lexical diversity and increased structural uniformity relative to human expert writingAityanet al\.\([2026](https://arxiv.org/html/2604.19768#bib.bib16)\); Al\-Shaibani and Ahmed \([2026](https://arxiv.org/html/2604.19768#bib.bib18)\)\.
### 2\.3Epistemic Modality and Stance in NLP
Computational work on hedging detection has provided a foundation for identifying expressions of uncertainty, particularly within scientific literatureClausen \([2010](https://arxiv.org/html/2604.19768#bib.bib30)\); Medlock and Briscoe \([2007](https://arxiv.org/html/2604.19768#bib.bib32)\)\. Hedge cues have been shown to be high\-precision markers of uncertainty, though their detection remains highly domain\-dependentSzarvas \([2008](https://arxiv.org/html/2604.19768#bib.bib31)\); Li and Zhang \([2025](https://arxiv.org/html/2604.19768#bib.bib23)\)\. In the context of LLMs, epistemic stance classification has evolved into the study of honesty alignment, where models are trained to verbalise confidence levels and express uncertainty explicitlyClarket al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib8)\); Leeet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib10)\); Taoet al\.\([2025](https://arxiv.org/html/2604.19768#bib.bib7)\)\. No existing work correlates the presence of these markers with the rhetorical elaborateness of the surrounding text\. Moreover, performed hesitancy is not distinguished from genuine epistemic marking as a functional category\.
## 3Theoretical Framework
### 3\.1Foundational Premises
We hypothesize that the decoupling of rhetorical form from epistemic grounding is a measurable property of text\. Three theoretical traditions, i\.e\., Gricean pragmatics, Relevance Theory, and Brandomian inferentialism, provide a unified account of why this relationship is normatively governed and why its disruption leaves a recoverable structural signal\.
#### 3\.1\.1Gricean Cooperative Principle
Gricean pragmatics holds that speakers in a communicative exchange are expected to observe implicit maxims governing the quantity, quality, relation, and manner of their contributionsGrice \([1975](https://arxiv.org/html/2604.19768#bib.bib38)\)\. The maxims of quantity and quality together establish a normative expectation that theweightof an assertion, the degree of confidence and elaborateness with which it is presented, should be proportionate to the speaker’s actual epistemic position\. A rhetorically elaborate claim that is evidentially thin violates the normative structure that makes assertion a cooperative act\. Consider the following pair:
> \(1\)“Some studies suggest that sleep deprivation may impair consolidation of declarative memory\.” \(2\)“Without question, sleep deprivation impairs attention, corrupts consolidation, and destroys the brain’s capacity to retain what it has learned across every domain of cognitive function, with consequences that are nothing short of profound\.”
Claim \(1\) deploys modal hedging \(may\), an evidential restrictor \(some studies suggest\), and domain qualification \(declarative memory\); its rhetorical weight is proportionate to the epistemic position it reports\. Claim \(2\) deploys tricolon, categorical assertion \(without question\), and auxesis \(nothing short of profound\)\. Its rhetorical weight vastly exceeds any single body of evidence that could license it\. A reader encountering Claim \(2\) is misled not by the content but by the assertoric form in which it is packaged\.
#### 3\.1\.2Relevance Theory
Relevance Theory grounds the Gricean intuition in a cognitive account of comprehensionWilson and Sperber \([2002](https://arxiv.org/html/2604.19768#bib.bib39)\)\. It proposes that communicative behaviour is governed by a presumption of optimal relevance: when a speaker produces an utterance requiring interpretive effort, that effort creates a corresponding expectation of proportionate cognitive effects\. Rhetorical devices such as tricolon or auxesis impose structural complexity on the reader\. Relevance Theory predicts this cost is licensed only when the content warrants it\. Consider a passage from an LLM\-generated example policy report n response to a prompt:
> \(3\)“Across the full spectrum of deployment contexts, from hiring to healthcare, from credit scoring to criminal justice, and from content moderation to educational assessment, algorithmic systems raise profound, multi\-dimensional, and deeply interconnected questions about equity\.”
A reader is expected to track a six\-item enumeration, resolve three parallel prepositional phrases, and integrate the import ofprofound,multi\-dimensional, anddeeply interconnected\. What the sentence delivers is the proposition that algorithmic systems raise questions about equity, a claim any reader already accepts\. The cognitive yield is arguably nearly zero relative to the processing cost, and this gap is computationally measurable\.
#### 3\.1\.3Brandomian Inferentialism
Brandom’s account of discursive practice treats a speaker’s assertion as the undertaking of a social and normative commitmentBrandom \([1994](https://arxiv.org/html/2604.19768#bib.bib40),[1997](https://arxiv.org/html/2604.19768#bib.bib41)\)\. Thestrengthof that commitment is a normatively governed function of theentitlementsthe speaker actually possesses\. When assertoric strength exceeds the entitlements backing it, as in Claim \(2\), a reader is licensed to treat the overclaimed proposition as a warranted premise for subsequent reasoning, concluding that the effect size is large, the evidence settled, and the claim universal\. The speaker has propagated an epistemic deficit into inference chains they will never be held accountable for\. It is this downstream liability that gives the form\-meaning decoupling its normative weight and motivates the divergence score as an instrument for detecting structural bias rather than merely poor style\.
### 3\.2The ERM Taxonomy
The ERM taxonomy is proposed to operationalize rhetorical form and epistemic grounding as independently annotatable properties of text across three separable levels of linguistic organization\. Its architecture, theoretical anchors, and derived metrics are illustrated in Figure[1](https://arxiv.org/html/2604.19768#S3.F1)\. The complete inventory is summarized in Table[1](https://arxiv.org/html/2604.19768#S3.T1), with full definitions and annotated examples in[A](https://arxiv.org/html/2604.19768#A1)\.
Figure 1:Architecture of the ERM taxonomy showing linkages between the three theoretical anchors \(left\), the six taxonomy levels \(centre\), and the four types of quantitative metrics \(right\)\.Table 1:Overview of the ERM taxonomy; full definitions and annotated examples are in[A](https://arxiv.org/html/2604.19768#A1)\.#### 3\.2\.1Level 1 – Rhetorical Devices
Level 1 captures formal presentational structures that modulate expressive intensity, i\.e\., thehowrather than thewhatof a claim\. Grounded in the Gricean quantity maxim and Relevance Theory’s processing\-cost principle, it provides the rhetorical intensity component for computing FMD and RDDE\. Ten devices are organized across three scales of operation \(Tables[4](https://arxiv.org/html/2604.19768#A1.T4)–[6](https://arxiv.org/html/2604.19768#A1.T6)\)\.
#### 3\.2\.2Level 2 – Epistemic Stance Markers
Level 2 classifies lexical and syntactic devices by which a text encodes its assertoric commitment\. The central distinction is betweengenuine epistemic markers\(2a\) that ground a claim in an identifiable evidential baseFintel and Gillies \([2007](https://arxiv.org/html/2604.19768#bib.bib6)\)andperformed hesitancy\(2b\) that adopt the surface register of uncertainty without its formal apparatus\. Both levels feed FMD and GPR differently \(see Section[4](https://arxiv.org/html/2604.19768#S4)\), an opposition that operationalizes the Brandomian distinction between inferential entitlement and its simulationBrandom \([1997](https://arxiv.org/html/2604.19768#bib.bib41)\)\.
#### 3\.2\.3Level 3 – Discourse\-level Argumentative Structure
Level 3 characterizes the global argumentative endpoint and inferential trajectory of a text, distinguishing four markers as defined in Table[9](https://arxiv.org/html/2604.19768#A1.T9)\. These are annotated at document level and reported as sub\-corpus proportions\.
## 4Methodological Implementation
The proposed methodological pipeline, illustrated in Figure[2](https://arxiv.org/html/2604.19768#S4.F2), comprises three stages: corpus construction and segmentation, annotation, and ERM feature engineering\.111All annotation prompts, corpus metadata, and analysis scripts will be open\-sourced \(post\-publication\) to support reproducibility and replication across extended and more diverse corpora\.
Figure 2:The ERM architecture pipeline\. Stage 1 constructs the corpus and segments each document into sentence sequences and Toulmin semantic chunks\. Stage 2 applies the ERM taxonomy across five parallel annotation passes covering rhetorical devices at three scales, genuine and performed epistemic markers, and closure markers\. Stage 3 transforms the annotation output into three composite metrics, i\.e\., FMD, GPR, RDDE and four discourse\-level structural proportions\.### 4\.1Stage 1: Corpus Construction and Segmentation
The corpus comprises three sub\-corpora corresponding to the three author types under investigation\. Let the full corpus be defined as
𝒞=\{𝒞E,𝒞NE,𝒞LLM\}\\mathcal\{C\}=\\\{\\mathcal\{C\}\_\{E\},\\;\\mathcal\{C\}\_\{NE\},\\;\\mathcal\{C\}\_\{LLM\}\\\}\(1\)where𝒞E\\mathcal\{C\}\_\{E\},𝒞NE\\mathcal\{C\}\_\{NE\}, and𝒞LLM\\mathcal\{C\}\_\{LLM\}denote the human expert, human non\-expert, and LLM\-generated sub\-corpora respectively\. Each sub\-corpus contains\|𝒞k\|=75\|\\mathcal\{C\}\_\{k\}\|=75documents:
𝒞k=\{d1,d2,…,d75\},k∈\{E,NE,LLM\}\\mathcal\{C\}\_\{k\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{75\}\\\},\\quad k\\in\\\{E,NE,LLM\\\}\(2\)
yielding a total corpus of 225 documents\. The sub\-corpora are summarized in Table[2](https://arxiv.org/html/2604.19768#S4.T2)\.
Table 2:Corpus summary \(n=75n=75per sub\-corpus, 225 total\)\. A pre\-November 2022 cutoff is applied toheandhnto exclude texts potentially assisted by generative AI tools\. Eachlgtext was generated using a uniform prompt template comprising an open\-ended argumentative question derived from the correspondinghetext’s central topic, a 1500\-word length target, and fixed structural instructions to take a clear position, acknowledge a counterargument, and engage honestly with uncertainty\.#### 4\.1\.1Segmentation
Each documentddis segmented into two distinct representational units serving different levels of ERM annotation, applied uniformly across all three sub\-corpora\.
For Levels 1 and 2, each documentddis represented as an ordered sequence of sentences produced by spaCy sentence boundary detectionHonnibalet al\.\([2020](https://arxiv.org/html/2604.19768#bib.bib48)\):
d=⟨s1,s2,…,sm⟩d=\\langle s\_\{1\},s\_\{2\},\\ldots,s\_\{m\}\\rangle\(3\)Rhetorical devices and epistemic stance markers are annotated at this level\.
For Level 3, each documentddis represented as an ordered sequence of argumentative chunks:
d=⟨c1,c2,…,cp⟩d=\\langle c\_\{1\},c\_\{2\},\\ldots,c\_\{p\}\\rangle\(4\)
where each chunkckc\_\{k\}is a contiguous span of one or more sentences assigned a single argumentative function label drawn from a seven\-way typology adapted from Toulmin’s model of argumentationToulmin \([2003](https://arxiv.org/html/2604.19768#bib.bib49)\):
τ\(ck\)∈\{Claim,Grounds,Warrant,Backing,Qualifier,Rebuttal,Non\-Argumentative\}\\tau\(c\_\{k\}\)\\in\\\{\\textsc\{Claim\},\\,\\textsc\{Grounds\},\\,\\textsc\{Warrant\},\\,\\textsc\{Backing\},\\,\\\\ \\textsc\{Qualifier\},\\,\\textsc\{Rebuttal\},\\,\\textsc\{Non\-Argumentative\}\\\}\(5\)
Non\-Argumentativeis assigned to transitional, expository, definitional, or illustrative spans that carry no direct argumentative function, following the precedent established in computational argumentation miningStab and Gurevych \([2017](https://arxiv.org/html/2604.19768#bib.bib50)\)\. Semantic chunks are non\-overlapping and jointly exhaustive over the document:
⋃k=1pck=d,ck∩ck′=∅fork≠k′\\bigcup\_\{k=1\}^\{p\}c\_\{k\}=d,\\qquad c\_\{k\}\\cap c\_\{k^\{\\prime\}\}=\\emptyset\\;\\;\\text\{for\}\\;\\;k\\neq k^\{\\prime\}\(6\)
Chunk boundaries are identified using a locally hosted LLM to segment each document and return structured span indices with type labels\.Non\-Argumentativechunks are excluded from Level 3 annotation; the remaining chunks constitute the argumentative skeleton of the document used to contextualise endpoint and trajectory judgements\.
The complete segmentation yields two parallel representations for each documentd∈𝒞d\\in\\mathcal\{C\}:
ℛ\(d\)=\(⟨s1,…,sm⟩,⟨c1,…,cp⟩\)\\mathcal\{R\}\(d\)=\\bigl\(\\langle s\_\{1\},\\ldots,s\_\{m\}\\rangle,\\;\\langle c\_\{1\},\\ldots,c\_\{p\}\\rangle\\bigr\)\(7\)
### 4\.2Stage 2: Annotation
For each documentdd, the annotator processes each text unit in a separate pass and returns a binary judgement for each marker in the corresponding taxonomy level, producing the structured annotation vector:
ϕ\(d\)=\(λ1a\(d\),λ1b\(d\),λ1c\(d\),λ2a\(d\),λ2b\(d\),λ3\(d\)\)\\phi\(d\)=\\bigl\(\\lambda\_\{1a\}\(d\),\\;\\lambda\_\{1b\}\(d\),\\;\\lambda\_\{1c\}\(d\),\\;\\lambda\_\{2a\}\(d\),\\;\\lambda\_\{2b\}\(d\),\\;\\lambda\_\{3\}\(d\)\\bigr\)\(8\)
whereλ1a\(d\)\\lambda\_\{1a\}\(d\),λ2a\(d\)\\lambda\_\{2a\}\(d\), andλ2b\(d\)\\lambda\_\{2b\}\(d\)are binary matrices of shapemd×5m\_\{d\}\\times 5,md×4m\_\{d\}\\times 4, andmd×2m\_\{d\}\\times 2respectively, recording sentence\-level rhetorical devices and epistemic stance markers;λ1b\(d\)\\lambda\_\{1b\}\(d\)is a binary matrix of shapep×3p\\times 3recording argument\-level rhetorical devices per chunk; andλ1c\(d\)\\lambda\_\{1c\}\(d\)andλ3\(d\)\\lambda\_\{3\}\(d\)are binary vectors of shape1×21\\times 2and1×41\\times 4recording narrative\-level rhetorical devices and Level 3 argumentative structure markers at document level, with the Toulmin chunk structure from Stage 1 providing structured context for Level 3 judgements\.
Algorithm[1](https://arxiv.org/html/2604.19768#alg1)summarises the complete ERM pipeline\. Each stage is designed to be executed either by a trained human annotator or by an LLM\. Human annotation is feasible, albeit requiring training on the ERM taxonomy, familiarity with Toulmin argumentation paradigm, and close reading \(Level 3\)\. LLM annotation may employ a capable, decent sized model\. The choice between human and LLM annotation trades interpretive depth for throughput while both modalities ultimately producing the annotation vectorϕ\(d\)\\phi\(d\)required by Stage 3\.
Algorithm 1Automated ERM Pipeline1:Input:Corpus
𝒞=\{𝒞E,𝒞NE,𝒞LLM\}\\mathcal\{C\}=\\\{\\mathcal\{C\}\_\{E\},\\mathcal\{C\}\_\{NE\},\\mathcal\{C\}\_\{LLM\}\\\}
2:Output:Per\-document features
\{δ\(d\),γ\(d\),ηnorm\(d\)\}\\\{\\delta\(d\),\\gamma\(d\),\\eta\_\{\\text\{norm\}\}\(d\)\\\}; sub\-corpus proportions
Px\(𝒞k\)P\_\{x\}\(\\mathcal\{C\}\_\{k\}\)
3:Stage 1: Segmentation
4:for alldocument
d∈𝒞d\\in\\mathcal\{C\}do
5:Detect sentence boundaries
⟨s1,…,sm⟩\\langle s\_\{1\},\\ldots,s\_\{m\}\\ranglevia spaCy
6:Segment
ddinto Toulmin chunks
⟨c1,…,cp⟩\\langle c\_\{1\},\\ldots,c\_\{p\}\\ranglewith labels
τ\(ck\)\\tau\(c\_\{k\}\)
7:endfor
8:Stage 2: Annotation\(human or LLM annotator\)
9:for alldocument
d∈𝒞d\\in\\mathcal\{C\}do
10:Pass 1:annotate each
si→λ1a\(d\)s\_\{i\}\\rightarrow\\lambda\_\{1a\}\(d\)
11:Pass 2:annotate each
si→λ2a\(d\),λ2b\(d\)s\_\{i\}\\rightarrow\\lambda\_\{2a\}\(d\),\\lambda\_\{2b\}\(d\)
12:Pass 3:annotate each non\-Non\-Arg
ck→λ1b\(d\)c\_\{k\}\\rightarrow\\lambda\_\{1b\}\(d\)
13:Pass 4:annotate
d→λ1c\(d\),λ3\(d\)d\\rightarrow\\lambda\_\{1c\}\(d\),\\lambda\_\{3\}\(d\)
14:endfor
15:Stage 3: Feature Engineering
16:for alldocument
d∈𝒞d\\in\\mathcal\{C\}do
17:Compute
ρ\(d\)\\rho\(d\),
εg\(d\)\\varepsilon\_\{g\}\(d\),
εp\(d\)\\varepsilon\_\{p\}\(d\)from
λ1∗\(d\)\\lambda\_\{1\*\}\(d\),
λ2∗\(d\)\\lambda\_\{2\*\}\(d\), and
mdm\_\{d\}
18:Compute
δ\(d\)←\(11\)\\delta\(d\)\\leftarrow\(11\)
19:Compute
γ\(d\)←\(12\)\\gamma\(d\)\\leftarrow\(12\)
20:Partition
ddinto 50\-word windows
21:Compute
ηnorm\(d\)←\(14\)\\eta\_\{\\text\{norm\}\}\(d\)\\leftarrow\(14\)
22:Extract binary Level 3 indicators
σ\(d\),α\(d\),π\(d\),ς\(d\)\\sigma\(d\),\\alpha\(d\),\\pi\(d\),\\varsigma\(d\)
23:endfor
24:for allsub\-corpus
𝒞k\\mathcal\{C\}\_\{k\}do
25:Compute
Px\(𝒞k\)P\_\{x\}\(\\mathcal\{C\}\_\{k\}\)for each
x∈\{σ,α,π,ς\}x\\in\\\{\\sigma,\\alpha,\\pi,\\varsigma\\\}
26:endfor
### 4\.3Stage 3: ERM Feature Engineering
The annotation matrices inϕ\(d\)\\phi\(d\)are transformed into four features, each capturing a distinct dimension of epistemic\-rhetorical miscalibration\. All density measures are normalized per sentence givenmdm\_\{d\}, the sentence count of documentdd\.
#### 4\.3\.1Form\-Meaning Divergence \(FMD\)
In order to measure the degree to which rhetorical elaboration exceeds epistemic grounding, the total rhetorical device count is aggregated across all three Level 1 sub\-levels and normalized as
ρ\(d\)=∑λ1a\(d\)\+∑λ1b\(d\)\+∑λ1c\(d\)md\\rho\(d\)=\\frac\{\\sum\\lambda\_\{1a\}\(d\)\+\\sum\\lambda\_\{1b\}\(d\)\+\\sum\\lambda\_\{1c\}\(d\)\}\{m\_\{d\}\}\(9\)The densities of genuine and performed epistemic markers are defined analogously as
εg\(d\)=∑λ2a\(d\)md,εp\(d\)=∑λ2b\(d\)md\\varepsilon\_\{g\}\(d\)=\\frac\{\\sum\\lambda\_\{2a\}\(d\)\}\{m\_\{d\}\},\\qquad\\varepsilon\_\{p\}\(d\)=\\frac\{\\sum\\lambda\_\{2b\}\(d\)\}\{m\_\{d\}\}\(10\)Form\-Meaning Divergence is then computed as
δ\(d\)=ρ\(d\)⋅εp\(d\)εg\(d\)\+1\\delta\(d\)=\\frac\{\\rho\(d\)\\cdot\\varepsilon\_\{p\}\(d\)\}\{\\varepsilon\_\{g\}\(d\)\+1\}\(11\)The multiplicative structure ensures that divergence is only elevated when rhetorical elaboration and epistemic miscalibration co\-occur\.εg\(d\)\\varepsilon\_\{g\}\(d\)in the denominator attenuates the score proportionally to the genuine epistemic grounding\. The constant\+1\+1prevents the division by zero\.
#### 4\.3\.2Genuine\-to\-Performed Epistemic Ratio \(GPR\)
A text may exhibit low divergence simply because its rhetorical intensity is low, while still relying predominantly on performed rather than genuine epistemic marking\. Therefore, epistemic calibration can be isolated independently of rhetorical intensity as
γ\(d\)=εg\(d\)εp\(d\)\+1\\gamma\(d\)=\\frac\{\\varepsilon\_\{g\}\(d\)\}\{\\varepsilon\_\{p\}\(d\)\+1\}\(12\)whereγ\(d\)\>1\\gamma\(d\)\>1indicates that genuine markers outweigh performed markers andγ\(d\)<1\\gamma\(d\)<1indicates the reverse\. Whereasδ\(d\)\\delta\(d\)captures the relationship between rhetorical form and epistemic content,γ\(d\)\\gamma\(d\)captures the internal composition of the epistemic layer alone\.
#### 4\.3\.3Rhetorical Device Distribution Entropy \(RDDE\)
It is reasonable to assume that human expert writers cluster rhetorical devices around moments of genuine argumentative stress\. Hence, a uniform distribution across the text suggests deployment driven by local stylistic habit rather than argumentative logic\. The document is partitioned into windows of up to 50 words; lettingwkw\_\{k\}denote the total Level 1 device count in windowkk, the Shannon entropy over the device distribution is:
η\(d\)=−∑k=1Kpklog2pk,pk=wk∑k′=1Kwk′\\eta\(d\)=\-\\sum\_\{k=1\}^\{K\}p\_\{k\}\\log\_\{2\}p\_\{k\},\\qquad p\_\{k\}=\\frac\{w\_\{k\}\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}w\_\{k^\{\\prime\}\}\}\(13\)To ensure comparability across documents of different lengths,η\(d\)\\eta\(d\)is normalised by the maximum possible entropy forKKwindows:
ηnorm\(d\)=η\(d\)log2K\\eta\_\{\\text\{norm\}\}\(d\)=\\frac\{\\eta\(d\)\}\{\\log\_\{2\}K\}\(14\)whereηnorm\(d\)=1\\eta\_\{\\text\{norm\}\}\(d\)=1indicates a perfectly uniform device distribution andηnorm\(d\)→0\\eta\_\{\\text\{norm\}\}\(d\)\\to 0indicates maximal clustering around argumentative pressure points\.
#### 4\.3\.4Discourse\-Level Meta\-Structural Measures
Level 3 annotation yields four binary document\-level indicators drawn from the ERM taxonomy: synthetic closureσ\(d\)\\sigma\(d\), aporetic endpointα\(d\)\\alpha\(d\), premature closureπ\(d\)\\pi\(d\), and speculative depthς\(d\)\\varsigma\(d\), each∈\{0,1\}\\in\\\{0,1\\\}\. These are corpus\-level proportions and are comparable across sub\-corpora using chi\-squared test\. For each markerx∈\{σ,α,π,ς\}x\\in\\\{\\sigma,\\alpha,\\pi,\\varsigma\\\}, the sub\-corpus proportion is computed as
Px\(𝒞k\)=\|\{d∈𝒞k:x\(d\)=1\}\|\|𝒞k\|P\_\{x\}\(\\mathcal\{C\}\_\{k\}\)=\\frac\{\\bigl\|\\\{d\\in\\mathcal\{C\}\_\{k\}:x\(d\)=1\\\}\\bigr\|\}\{\|\\mathcal\{C\}\_\{k\}\|\}\(15\)
## 5Results and Discussion
Table[3](https://arxiv.org/html/2604.19768#S5.T3)presents a broad statistical overview of the ERM metrics and key device markers across the three sub\-corpora\. Several patterns are immediately apparent\. LLM\-generated texts show the highest FMD \(δ¯=0\.017\\bar\{\\delta\}=0\.017\) and the most uniform RDDE \(η¯norm=0\.753\\bar\{\\eta\}\_\{\\text\{norm\}\}=0\.753\), while human expert texts lead on the GPR \(γ¯=0\.267\\bar\{\\gamma\}=0\.267\)\. At the device level, tricolon is the strongest single differentiator, with LLM\-generated texts producing nearly twice the expert rate\. The erotema, interestingly, shows the reverse pattern, being substantially higher in both human groups than in LLM output\. Performed epistemic marker density is approximately double in LLM texts relative to either human group, while the two human sub\-corpora are statistically indistinguishable on this measure\. Within the LLM sub\-corpus, no significant differences are observed across the four models tested, suggesting that the miscalibration pattern is a structural property of LLM generation rather than a model\-specific artifact\.
Table 3:Key ERM metrics and device markers across sub\-corpora\. Bold values mark sub\-corpora that differ significantly from at least one other group \(significance thresholdp<0\.05p<0\.05, corrected for multiple comparisons\)\.Δ\\Deltais Cohen’s effect size for the Human Expert vs LLM\-Generated comparison\.MetricHEμ\(σ\)\\mu\\,\(\\sigma\)HNμ\(σ\)\\mu\\,\(\\sigma\)LGμ\(σ\)\\mu\\,\(\\sigma\)Δ\\DeltaComposite metricsFMDδ\(d\)\\delta\(d\)0\.009 \(0\.010\)0\.012 \(0\.019\)0\.017 \(0\.016\)0\.68GPRγ\(d\)\\gamma\(d\)0\.267 \(0\.133\)0\.172 \(0\.075\)0\.217 \(0\.105\)0\.42RDDEηnorm\(d\)\\eta\_\{\\text\{norm\}\}\(d\)0\.666 \(0\.143\)0\.697 \(0\.123\)0\.753 \(0\.083\)0\.74Level 1 device markers \(mean count per document\)Tricolon3\.73 \(3\.48\)4\.87 \(3\.26\)7\.13 \(3\.66\)0\.95Erotema5\.55 \(5\.99\)5\.11 \(5\.03\)2\.28 \(2\.17\)0\.73Correctio0\.40 \(0\.64\)0\.45 \(0\.83\)0\.17 \(0\.48\)0\.40Level 2 epistemic markersPerformedεp\\varepsilon\_\{p\}0\.057 \(0\.051\)0\.058 \(0\.069\)0\.114 \(0\.102\)0\.72Complexity tokens4\.63 \(4\.96\)4\.77 \(7\.03\)7\.33 \(7\.33\)0\.43### 5\.1Rhetorical Device Repertoire
Figure 3:Mean count per document of all Level 1 and Level 2b markers\. Significance markers:p∗<\.05\{\}^\{\*\}p\{<\}\.05,p∗∗<\.01\{\}^\{\*\*\}p\{<\}\.01,p∗∗∗<\.001\{\}^\{\*\*\*\}p\{<\}\.001\. Unmarked devices show no significant difference\. Section labels on the right indicate the ERM taxonomy level\.Figure[3](https://arxiv.org/html/2604.19768#S5.F3)reveals a systematic divergence in rhetorical devices between LG and human\-authored texts\.Tricolonis the strongest differentiating marker \(p<0\.001p<0\.001\) with LLM\-generated texts producing a mean of 7\.13 per document versus 3\.73 for human experts \(Δ\\Delta=0\.95=0\.95\), thus consistent with tricolon functioning as a default structural filler deployed independently of argumentative occasion\.
Erotemashows the opposite pattern\. LLM texts show a frequency of only 2\.28 per document against 5\.55 \(HE\) and 5\.11 \(HNE\)\. While both human groups are statistically indistinguishable from each other \(p=1\.00p=1\.00\), these are significantly higher than LG \(Δ≈0\.73\\Delta\\approx 0\.73\)\. Since erotema requires the assertion of a proposition through the form of a question, its suppression in LLM output is consistent with occasion\-dependent devices being less accessible to statistical generation, though prompt\-induced register effects cannot be fully excluded\.
Bothperformed epistemic markersare significantly elevated in LLM output relative to both human groups\. The observation confirms performed hesitancy as a specific LLM signature\.Correctiois modestly but significantly lower in LLM texts \(Δ≈0\.41\\Delta\\approx 0\.41\), consistent with reduced genuine argumentative revision\. Other markers show no significant differences across sub\-corpora\.
### 5\.2Epistemic Positioning
The mapping of each document in the planeεg\\varepsilon\_\{g\}–εp\\varepsilon\_\{p\}reveals three distinct epistemic positions \(Figure[4](https://arxiv.org/html/2604.19768#S5.F4)\)\. The HE ellipse extends farthest along theεg\\varepsilon\_\{g\}axis, with the centroid located well into the genuine\-dominant region\. The compressed and origin\-positioned HN ellipse reflects low density of both marker types\. The LG ellipse is notably elongated along theεp\\varepsilon\_\{p\}axis, with the centroid displaced upward relative to both human groups, indicating systematically elevated performed marker density relative to genuine epistemic grounding\.
Figure 4:Distribution of documents in the genuine \(εg\\varepsilon\_\{g\}\) versus performed \(εp\\varepsilon\_\{p\}\) epistemic marker density plane\. Large markers denote group centroids\. Dashed ellipses are 95% confidence regions\. The diagonal reference line marksεg=εp\\varepsilon\_\{g\}=\\varepsilon\_\{p\}\(GPR=1=1\); documents above the line show performed\-marker dominance and documents below show genuine\-marker dominance\.The performed epistemic marker density of LG is significantly higher \(ε¯p=0\.114\\bar\{\\varepsilon\}\_\{p\}=0\.114\) as compared to HE \(ε¯p=0\.057\\bar\{\\varepsilon\}\_\{p\}=0\.057;Δ=0\.72\\Delta=0\.72,p<0\.001p<0\.001\) and HN \(ε¯p=0\.058\\bar\{\\varepsilon\}\_\{p\}=0\.058;Δ=0\.65\\Delta=0\.65,p<0\.001p<0\.001\) texts\. The near\-identicalεp\\varepsilon\_\{p\}values of the two human groups confirm that performed hesitancy is a specifically LLM signature rather than a marker of non\-expertise\.
The density of genuine epistemic markersεg\\varepsilon\_\{g\}is highest in HE texts \(ε¯g=0\.283\\bar\{\\varepsilon\}\_\{g\}=0\.283\), followed by LG \(ε¯g=0\.242\\bar\{\\varepsilon\}\_\{g\}=0\.242\) and HN texts \(ε¯g=0\.182\\bar\{\\varepsilon\}\_\{g\}=0\.182\)\. This ordering of LLM lying between the two human groups seems consistent with the corpus design, i\.e\., the uniform prompt instruction toengage honestly with uncertaintylikely induced more genuine epistemic marking in LLM output than unconstrained generation might have produce\.
### 5\.3Composite Metrics
#### 5\.3\.1Form\-Meaning Divergence
FMD follows a monotonic ordering across sub\-corpora \(Figure[5](https://arxiv.org/html/2604.19768#S5.F5)\(a\)\)\. The HE texts score lowest \(δ¯=0\.009\\bar\{\\delta\}=0\.009\), HN intermediate \(δ¯=0\.012\\bar\{\\delta\}=0\.012\), and LG highest \(δ¯=0\.017\\bar\{\\delta\}=0\.017\)\. Both human groups differ significantly from LG \(p<0\.001p<0\.001;Δ=0\.68\\Delta=0\.68for HE vs LG,Δ=0\.30\\Delta=0\.30for HN vs LG\), while the difference between the two human groups does not reach significance as the gap is small relative to within\-group variance\. The dominant effect is the elevation of LLM output beyond both human groups\. LLM texts combine elevated rhetorical intensityρ\(d\)\\rho\(d\)while genuine epistemic grounding in the denominator of \([11](https://arxiv.org/html/2604.19768#S4.E11)\) does not proportionally compensate\. In Gricean terms, their assertoric weight consistently exceeds the epistemic position they report\.
Figure 5:Distribution of the three composite ERM metrics across sub\-corpora\. Each violin shows the full density estimate; the internal bar spans the interquartile range; the white dot marks the median; jittered points show individual documents\. Significance brackets:p∗<\.05\{\}^\{\*\}p\{<\}\.05,p∗∗∗<\.001\{\}^\{\*\*\*\}p\{<\}\.001\. \(a\) FMDδ\(d\)\\delta\(d\)\. \(b\) GPRγ\(d\)\\gamma\(d\)\. \(c\) RDDEηnorm\(d\)\\eta\_\{\\mathrm\{norm\}\}\(d\)\.
#### 5\.3\.2Genuine\-to\-Performed Epistemic Ratio
GPR shows an interesting three\-way ordering \(Figure[5](https://arxiv.org/html/2604.19768#S5.F5)\(b\)\), with HE highest \(γ¯=0\.267\\bar\{\\gamma\}=0\.267\), LG intermediate \(γ¯=0\.217\\bar\{\\gamma\}=0\.217\), and HN lowest \(γ¯=0\.172\\bar\{\\gamma\}=0\.172\) \(p<0\.001p<0\.001\)\. The largest pairwise contrast is between HE and HN texts \(Δ=0\.89\\Delta=0\.89,p<0\.001p<0\.001\), reflecting expert advantage in grounding claims in identifiable evidential bases such as extensively higher modal auxiliary counts, adverbials, and syntactic restrictors\. LLM texts sit above non\-experts but below experts\. The LG vs HN comparison does not reach significance \(p=0\.055p=0\.055\), which is consistent with the prompt instruction to engage honestly with uncertainty\.
#### 5\.3\.3Rhetorical Device Distribution Entropy
LLM\-generated texts show \(Figure[5](https://arxiv.org/html/2604.19768#S5.F5)\(c\)\) the highest and most consistent RDDE \(η¯norm=0\.753,σηnorm=0\.083\\bar\{\\eta\}\_\{\\text\{norm\}\}=0\.753,\\sigma\_\{\\eta\_\{\\text\{norm\}\}\}=0\.083\) compared to HE \(η¯ηnorm=0\.666,σnorm=0\.143\\bar\{\\eta\}\_\{\\eta\_\{\\text\{norm\}\}\}=0\.666,\\sigma\_\{\\text\{norm\}\}=0\.143;Δ=0\.74\\Delta=0\.74,p<0\.001p<0\.001\) and HN \(η¯norm=0\.697\\bar\{\\eta\}\_\{\\text\{norm\}\}=0\.697;Δ=0\.53\\Delta=0\.53,p=0\.011p=0\.011\) texts\. The human groups do not differ significantly from each other \(p=0\.573p=0\.573\)\. Higher entropy indicates more uniform device distribution across the document consistent with stylistically enforcing habit rather than argumentative occasion\. The lower LG variance, on the other hand, points to template\-driven generation producing cross\-document consistency absent in human writing\.
### 5\.4Discourse\-Level Structural Markers
Out of corpus\-level proportions for the four Level 3 markers, only the aporetic endpoint reaches significance \(Figure[6](https://arxiv.org/html/2604.19768#S5.F6)\)\. Synthetic closure dominates across all three sub\-corpora \(HE: 66\.7%, HNE: 69\.3%, LG: 76\.0%,p=0\.433p=0\.433\), consistent with the argumentative essay genre tending toward resolution regardless of author type\. Premature closure and speculative depth show no significant differences\.
Figure 6:Corpus\-level proportions of Level 3 discourse markers per sub\-corpus\. Cell values show the percentage and raw count of documents annotated with each marker\. Color intensity encodes proportion\. Column headers show significance of chi\-squared tests\.Aporetic endpoint is elevated in LG texts \(24\.0%\) relative to HN \(8\.0%\) and HE \(2\.7%\) texts \(p<0\.001p<0\.001\)\. This counters the intution that LLMs default to synthetic closure\. A plausible interpretation being that the aporetic form may be a learnable surface pattern, i\.e\., open\-ended phrases \(such asremains an open question\) deployed reflexively without genuine evidential underdetermination, constituting a discourse\-level instance of performed hesitancy\.
### 5\.5Model\-Level Variation Within the LLM Sub\-corpus
Figure[7](https://arxiv.org/html/2604.19768#S5.F7)presents ERM metric distributions by model within the LLM sub\-corpus \(ngpt=31n\_\{\\text\{gpt\}\}=31;ndeepseek=19n\_\{\\text\{deepseek\}\}=19;nclaude=13n\_\{\\text\{claude\}\}=13;ngemini=12n\_\{\\text\{gemini\}\}=12\)\. No significant differences are observed across models on any composite metric\.
Figure 7:Distribution of composite ERM metrics within the LLM\-generated sub\-corpus by model\. Box spans the interquartile range; white line marks the median; jittered points show individual documents\. Kruskal\-Wallispp\-values are displayed in the top\-right corner of each panel\. \(a\) FMDδ\(d\)\\delta\(d\)\. \(b\) GPRγ\(d\)\\gamma\(d\)\. \(c\) RDDEηnorm\(d\)\\eta\_\{\\mathrm\{norm\}\}\(d\)\.This null result is itself a substantive finding\. The epistemic\-rhetorical miscalibration pattern is not an artifact of any single model’s generation strategy but appears to be a systematic property of the LLM generation paradigm as a whole\. The ERM framework, therefore, is potentially robust to model variations\.
### 5\.6Discussion, Limitations and Future Directions
The findings converge on a coherent characterisation of LLM epistemic\-rhetorical miscalibration operating across multiple linguistic levels\. At the device level, LLM texts favour tricolon, deployable without argumentative occasion, while suppressing erotema, which requires one\. At the epistemic level, the miscalibration is specifically aperformed hesitancy excess\. LLMs produce genuine markers at a reasonable rate but performed markers at twice the human rate\. Additionally, the two human groups remain indistinguishable on performed hesitancy, confirming this as an LLM\-specific signature\. At the discourse level, elevated FMD and uniform RDDE jointly indicate that rhetorical elaboration runs independently of argumentative structure\.
Taken together, these patterns are consistent with the theoretical intuitions supplied by Gricean pragmatics, Relevance Theory, and Brandomian inferentialism\. LLM texts deploy assertoric weight that exceeds their epistemic position, impose processing costs without proportionate epistemic returns, and undertake discursive commitments that their visible inferential entitlements do not adequately license\.
However, there are several noteworthy limitations\. The corpus covers English argumentative prose only\. Generalisation to other languages, genres, and stylistic domains remains untested\. LLM annotation was used for scalability without formal, all\-encompassing, inter\-annotator reliability established against human coders, considering it beyond the scope of a philosophically inspired, exploratory framework\. The prompt instruction to engage honestly with uncertainty likely suppressed performed hesitancy in the LG sub\-corpus, therefore, the reported FMD values may still underestimate miscalibration in unconstrained LLM output normally seen in digital content all over internet\. Specific markers with extended LG sub\-corpus with variable model representation may give further insights into epistemic alignment of different model variants\.
Finally, the ERM taxonomy is designed to measure empirically falsifiable, optimizable and programmatically discernible surface\-linguistic markers\. The underlying, more complex epistemic states remain a subject of close\-reading augmented, qualitative linguistic enquiry\.
## 6Conclusion
This study introduced a framework for quantifying epistemic\-rhetorical miscalibration in large language models through annotation of rhetorical devices, epistemic stance markers, and discourse\-level argumentative structure\. Applied to a medium sized corpus across three author types, the framework identifies a consistent and model\-agnostic LLM signature, including elevated tricolon density, suppressed erotema, and performed hesitancy at twice the human rate, replicated across four frontier models and grounded in Gricean pragmatics, Relevance Theory, and Brandomian inferentialism\.
The ERM pipeline is fully automatable and executable by any capable language model against any argumentative text\. This makes it deployable as a lightweight miscalibration screening tool for better epistemic alignment of target texts\. Significant individual markers such as tricolon density and performed hesitancy density could also serve as features in LLM\-generated text detection pipelines, complementing existing stylometric approaches with theoretically motivated epistemic signals\. FMD and GPR further operationalize properties not captured by standard benchmarks, measuring proportionality of assertoric weight to epistemic position and calibration of rhetorical form to evidential grounding\.
## Appendix AERM Taxonomy
Table 4:Level 1a – Sentence\-level Rhetorical Devices\.Devices that operate within a single sentence to modulate presentational intensity through syntactic and phonological structure\.Table 5:Level 1b – Argument\-level Rhetorical Devices\.Devices that organise the internal structure of an argument across multiple sentences or sub\-claims\.Table 6:Level 1c – Narrative\-level Rhetorical Devices\.Devices that operate across the full span of a discourse unit, shaping its overall trajectory and producing large\-scale rhetorical effects such as surprise or discovery\.Table 7:Level 2a – Genuine Epistemic Stance Markers\.Lexical and syntactic devices that ground a speaker’s commitment to a claim in an identifiable evidential base or calibrated modal force\.Table 8:Level 2b – Performed Epistemic Stance Markers\.Surface devices that adopt the register of uncertainty or reflexivity without providing the evidential grounding, modal calibration, or restrictor structure that genuine epistemic marking presupposes\.Table 9:Level 3 – Discourse\-level Argumentative Structure\.Global properties of the argument’s endpoint and overall inferential trajectory, identified at the level of the whole text rather than the sentence or clause\.
## References
- Text Authorship Attribution: Stylometric Insights into Human and LLM\-Generated Text\.InProceedings of the 8th International Conference on Data Science and Management of Data \(12th ACM IKDD CODS and 30th COMAD\),CODS\-COMAD ’24,New York, NY, USA,pp\. 344–346\.External Links:ISBN 979\-8\-4007\-1124\-4,[Document](https://dx.doi.org/10.1145/3703323.3703712)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- S\. K\. Aityan, W\. Claster, K\. S\. Emani, S\. Rais, and T\. Tran \(2026\)A Lightweight Approach to Detection of AI\-Generated Texts Using Stylometric Features\.arXiv\.Note:arXiv:2511\.21744 \[cs\]External Links:[Document](https://dx.doi.org/10.48550/arXiv.2511.21744)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- M\. S\. Al\-Shaibani and M\. Ahmed \(2026\)Arabic machine\-generated text detection: Stylometric analysis and cross\-model evaluation\.Expert Systems with Applications305,pp\. 130644\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/10.1016/j.eswa.2025.130644)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- E\. M\. Bender, T\. Gebru, A\. McMillan\-Major, and S\. Shmitchell \(2021\)On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?\.InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,FAccT ’21,New York, NY, USA,pp\. 610–623\.External Links:ISBN 978\-1\-4503\-8309\-7,[Document](https://dx.doi.org/10.1145/3442188.3445922)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- T\. Bisztray, B\. Cherif, R\. A\. Dubniczky, N\. Gruschka, B\. Borsos, M\. A\. Ferrag, A\. Kovacs, V\. Mavroeidis, and N\. Tihanyi \(2026\)I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution\.InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security,AISec ’25,New York, NY, USA,pp\. 28–39\.External Links:ISBN 979\-8\-4007\-1895\-3,[Document](https://dx.doi.org/10.1145/3733799.3762964)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- S\. L\. Blodgett, S\. Barocas, H\. Daumé III, and H\. Wallach \(2020\)Language \(Technology\) is Power: A Critical Survey of “Bias” in NLP\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 5454–5476\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.485)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- R\. Brandom \(1994\)Making it Explicit: Reasoning, Representing, and Discursive Commitment\.Harvard university press\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p4.1),[§3\.1\.3](https://arxiv.org/html/2604.19768#S3.SS1.SSS3.p1.1)\.
- R\. Brandom \(1997\)Précis of Making It Explicit\.Philosophy and Phenomenological Research57\(1\),pp\. 153–156\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p4.1),[§3\.1\.3](https://arxiv.org/html/2604.19768#S3.SS1.SSS3.p1.1),[§3\.2\.2](https://arxiv.org/html/2604.19768#S3.SS2.SSS2.p1.1)\.
- J\. Burrows \(2002\)‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship\.Literary and linguistic computing17\(3\),pp\. 267–287\.Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- N\. Clark, H\. Shen, B\. Howe, and T\. Mitra \(2025\)Epistemic Alignment: A Mediating Framework for User\-LLM Knowledge Delivery\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.01205)Cited by:[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- D\. Clausen \(2010\)HedgeHunter: A System for Hedge Detection and Uncertainty Classification\.InProceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task,R\. Farkas, V\. Vincze, G\. Szarvas, G\. Móra, and J\. Csirik \(Eds\.\),Uppsala, Sweden,pp\. 120–125\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p3.1),[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- F\. Erhardt \(2025\)Metacognitive Text Organizastion Semiotic and Rhetorical Agency in LLMs\.Arts and Humanities\(en\)\.External Links:[Document](https://dx.doi.org/10.20944/preprints202512.1177.v1)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§1](https://arxiv.org/html/2604.19768#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- K\. V\. Fintel and A\. S\. Gillies \(2007\)An Opinionated Guide to Epistemic Modality\.InOxford Studies In Epistemology,T\. S\. Gendler and J\. Hawthorne \(Eds\.\),pp\. 32–62\(en\)\.External Links:[Document](https://dx.doi.org/10.1093/oso/9780199237067.003.0002)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p2.1),[§3\.2\.2](https://arxiv.org/html/2604.19768#S3.SS2.SSS2.p1.1)\.
- I\. O\. Gallegos, R\. A\. Rossi, J\. Barrow, M\. M\. Tanjim, S\. Kim, F\. Dernoncourt, T\. Yu, R\. Zhang, and N\. K\. Ahmed \(2024\)Bias and Fairness in Large Language Models: A Survey\.Computational Linguistics\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2309.00770)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- H\. P\. Grice \(1975\)Logic and Conversation\.InSpeech acts,pp\. 41–58\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p4.1),[§3\.1\.1](https://arxiv.org/html/2604.19768#S3.SS1.SSS1.p1.1)\.
- Y\. Guo, M\. Guo, J\. Su, Z\. Yang, M\. Zhu, H\. Li, M\. Qiu, and S\. S\. Liu \(2024\)Bias in Large Language Models: Origin, Evaluation, and Mitigation\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2411.10915)Cited by:[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- M\. Honnibal, I\. Montani, S\. Van Landeghem, and A\. Boyd \(2020\)SpaCy: industrial\-strength natural language processing in python\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1212303)Cited by:[§4\.1\.1](https://arxiv.org/html/2604.19768#S4.SS1.SSS1.p2.1)\.
- F\. Jannidis, S\. Pielström, C\. Schöch, and T\. Vitt \(2015\)Improving Burrows’ Delta\. An Empirical Evaluation of Text Distance Measures\.InDigital Humanities Conference,Vol\.11,pp\. 10\.Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- Z\. Ji, T\. Yu, Y\. Xu, N\. Lee, E\. Ishii, and P\. Fung \(2023\)Towards Mitigating LLM Hallucination via Self Reflection\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 1827–1843\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.123)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- K\. Kenthapadi, M\. Sameki, and A\. Taly \(2024\)Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned \(Survey\)\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 6523–6533\.Cited by:[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- O\. Kramer \(2025\)RHET AI\. Critical Rhetoric in the Age of Artificial Intelligence\.Argumentation et Analyse du Discours35\(en\)\.External Links:ISSN 1565\-8961,[Document](https://dx.doi.org/10.4000/14yb8)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p3.1)\.
- T\. Kumarage and H\. Liu \(2023\)Neural Authorship Attribution: Stylometric Analysis on Large Language Models\.arXiv\.Note:arXiv:2308\.07305 \[cs\]External Links:[Document](https://dx.doi.org/10.48550/arXiv.2308.07305)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- D\. Lee, Y\. Hwang, Y\. Kim, J\. Park, and K\. Jung \(2025\)Are LLM\-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM\-based Evaluation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8962–8984\.External Links:ISBN 979\-8\-89176\-189\-6,[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.452)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p2.1),[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- M\. Li, M\. Vrazitulis, and D\. Schlangen \(2025\)Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 27734–27757\.External Links:ISBN 979\-8\-89176\-251\-0,[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1345)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§1](https://arxiv.org/html/2604.19768#S1.p2.1)\.
- Z\. Li and Q\. Zhang \(2025\)Linguistic Differences between AI and Human Comments in Weibo: Detect AI\-Generated Text through Stylometric Features\.InProceedings of the 24th China National Conference on Computational Linguistics,pp\. 842–851\(en\)\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p2.1),[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- Z\. P\. Majdik and S\. S\. Graham \(2024\)Rhetoric of/with AI: An Introduction\.Rhetoric Society Quarterly54\(3\),pp\. 222–231\(en\)\.External Links:ISSN 0277\-3945, 1930\-322X,[Document](https://dx.doi.org/10.1080/02773945.2024.2343264)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- H\. A\. McKee and J\. E\. Porter \(2020\)Ethics for AI Writing: The Importance of Rhetorical Context\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,AIES ’20,New York, NY, USA,pp\. 110–116\.External Links:ISBN 978\-1\-4503\-7110\-0,[Document](https://dx.doi.org/10.1145/3375627.3375811)Cited by:[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- B\. Medlock and T\. Briscoe \(2007\)Weakly Supervised Learning for Hedge Classification in Scientific Literature\.InProceedings of the 45th Annual Meeting of the Association of Computational Linguistics,A\. Zaenen and A\. van den Bosch \(Eds\.\),Prague, Czech Republic,pp\. 992–999\.Cited by:[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- T\. Neal, K\. Sundararajan, A\. Fatima, Y\. Yan, Y\. Xiang, and D\. Woodard \(2017\)Surveying Stylometry Techniques and Applications\.ACM Computing Surveys \(CSuR\)50\(6\),pp\. 1–36\.Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.
- C\. Raj, A\. Mukherjee, A\. Caliskan, A\. Anastasopoulos, and Z\. Zhu \(2024\)Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis\.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society7\(1\),pp\. 1180–1189\.Cited by:[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- R\. Ranjan, S\. Gupta, and S\. N\. Singh \(2024\)A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions\.arXiv\.External Links:[Link](http://arxiv.org/abs/2409.16430),[Document](https://dx.doi.org/10.48550/arXiv.2409.16430)Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- S\. Raza, O\. Bamgbose, S\. Ghuge, F\. Tavakoli, D\. J\. Reji, and S\. R\. Bashir \(2025\)Developing Safe and Responsible Large Language Model: Can We Balance Bias Reduction and Language Understanding?\.Machine Learning114\(6\),pp\. 140\(en\)\.External Links:ISSN 1573\-0565,[Document](https://dx.doi.org/10.1007/s10994-025-06767-4)Cited by:[§2\.1](https://arxiv.org/html/2604.19768#S2.SS1.p1.1)\.
- C\. Stab and I\. Gurevych \(2017\)Parsing argumentation structures in persuasive essays\.Computational Linguistics43\(3\),pp\. 619–659\.Cited by:[§4\.1\.1](https://arxiv.org/html/2604.19768#S4.SS1.SSS1.p8.1)\.
- G\. Szarvas \(2008\)Hedge Classification in Biomedical Texts with a Weakly Supervised Selection of Keywords\.InProceedings of ACL\-08: HLT,J\. D\. Moore, S\. Teufel, J\. Allan, and S\. Furui \(Eds\.\),Columbus, Ohio,pp\. 281–289\.Cited by:[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- L\. Tao, Y\. Yeh, B\. Kai, M\. Dong, T\. Huang, T\. A\. Lamb, J\. Yu, P\. H\. S\. Torr, and C\. Xu \(2025\)Can Large Language Models Express Uncertainty Like Human?\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2509.24202)Cited by:[§2\.3](https://arxiv.org/html/2604.19768#S2.SS3.p1.1)\.
- S\. E\. Toulmin \(2003\)The uses of argument\.Cambridge university press\.Cited by:[§4\.1\.1](https://arxiv.org/html/2604.19768#S4.SS1.SSS1.p6.1)\.
- D\. Wilson and D\. Sperber \(2002\)Relevance Theory\.Handbook of Pragmatics\.Cited by:[§1](https://arxiv.org/html/2604.19768#S1.p4.1),[§3\.1\.2](https://arxiv.org/html/2604.19768#S3.SS1.SSS2.p1.1)\.
- W\. Zaitsu, M\. Jin, S\. Ishihara, S\. Tsuge, and M\. Inaba \(2025\)Stylometry can reveal artificial intelligence authorship, but humans struggle: A comparison of human and seven large language models in Japanese\.PLOS ONE20\(10\),pp\. e0335369\(en\)\.External Links:ISSN 1932\-6203,[Document](https://dx.doi.org/10.1371/journal.pone.0335369)Cited by:[§2\.2](https://arxiv.org/html/2604.19768#S2.SS2.p1.1)\.Similar Articles
LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data
This paper explores Large Language Models' inability to recognize their knowledge limits on structured clinical data, proposing a cross-model attribution divergence method to detect epistemic blind spots. The approach improves calibration and accuracy without training by combining few-shot examples and SHAP-derived feature evidence.
A better method for identifying overconfident large language models
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.
Large Language Models Are Overconfident in Their Own Responses
This paper investigates why instruction-tuned LLMs are overconfident in their own responses, identifying an 'ownership bias' that gives higher confidence to self-generated answers. It proposes a simple inference-time strategy to reframe the model's answer as user input, improving calibration by up to 26% without retraining.
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
This paper investigates how large language models maintain correct beliefs under adversarial pressure in clinical settings, proposing R-FT fine-tuning to improve epistemic resilience while balancing corrigibility, and demonstrating significant robustness gains on medical benchmarks.