PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

arXiv cs.CL Papers

Summary

PragReST is a self-supervised framework that improves LLM pragmatic reasoning by generating counterfactual reasoning traces and training models via supervised fine-tuning and reinforcement learning, achieving significant gains on pragmatic benchmarks without human-labeled data.

arXiv:2606.18624v1 Announce Type: new Abstract: Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding
Source: [https://arxiv.org/html/2606.18624](https://arxiv.org/html/2606.18624)
Jihyung Park Minchao Huang11footnotemark:1Leqi Liu Elias Stengel\-Eskin The University of Texas at Austin

###### Abstract

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning\. Despite strong performance on math and logical reasoning, large language models \(LLMs\) still struggle with making pragmatic inferences, often choosing literal interpretations\. To improve LLM pragmatic reasoning, we introducePragReST, a self\-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine\-tuning and reinforcement learning, without human\-labeled training data or distillation from a stronger teacher\. Across four pragmatic benchmarks \(PragMega,Ludwig,MetoQA, andAltPrag\),PragReSTimproves over backbone models, task\-specific pragmatic tuning baselines, and non\-counterfactual variants of the same pipeline\. On accuracy\-based benchmarks,PragReSTimproves over the instruct backbone by 5\.37 and 5\.50% \(absolute\) for Qwen3\-8B and Qwen3\-14B, respectively\. Our error analysis and ablations underscore the importance of counterfactual reasoning:PragReSTprimarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance\. Moreover, our training preserves out\-of\-domain performance on general\-knowledge and mathematical reasoning benchmarks\.111Code and models available[here](https://github.com/jihyung803/PragReST)\.

PragReST: Self\-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Jihyung Park††thanks:Equal contribution\.Minchao Huang11footnotemark:1Leqi Liu Elias Stengel\-EskinThe University of Texas at Austin

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.18624v1/x1.png)Figure 1:Example of counterfactual pragmatic reasoning inPragReST: successfully recovering the intended meaning involves reasoning about alternative utterances\. Compared with the instruct model,PragReSTshifts preference from the literal interpretation to the intended pragmatic interpretation\.Robust language understanding requires recovering the intentions, assumptions, and implicit meanings that speakers leave unsaid\(Grice,[1975](https://arxiv.org/html/2606.18624#bib.bib2)\), i\.e\., making pragmatic inference beyond literal meaning and reasoning about shared context, speaker goals, and background information such as implicatures and presuppositions\(Levinson,[1983](https://arxiv.org/html/2606.18624#bib.bib3)\)\. Humans can make these inferences with ease in daily conversation\. Although large language models \(LLMs\) have improved substantially on reasoning\-heavy domains such as mathematics and code\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib8); Trunget al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib9)\), recent work suggests that they still struggle with pragmatic inference tasks\(Ruiset al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib15); Huet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib14); Maet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib7); Sravanthiet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib6)\)\. These findings point to a persistent gap between LLMs and humans in pragmatic reasoning, suggesting that surface fluency and broad semantic competence do not, by themselves, constitute robust communicative understanding\(Friedet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib4)\)\.

Fundamentally, pragmatic inference can be framed as a*counterfactual*reasoning process: to interpret an utterance, a listener compares what the speaker actually said with what they would likely have said under alternative intended meanings\(Frank and Goodman,[2012](https://arxiv.org/html/2606.18624#bib.bib21)\)\. Rather than judging an interpretation only by whether it is compatible with the literal words, a listener asks whether the speaker would have chosen a different utterance if that interpretation were the intended one\.[Fig\.˜1](https://arxiv.org/html/2606.18624#S1.F1)illustrates this process: when Mary answers Ken’s tea question with*“In a cup\.”*, a literal interpretation would be that she wants her tea in a cup, while a counterfactual reasoning process surfaces Mary’s displeasure\. Indeed, counterfactual and pragmatic reasoning have been linked at the level of brain responses\(Kulakova and Nieuwland,[2016](https://arxiv.org/html/2606.18624#bib.bib58)\)and counterfactual reasoning is central to pragmatic frameworks like Iterated Best Response \(IBR\)\(Franke,[2009](https://arxiv.org/html/2606.18624#bib.bib59)\)and Rational Speech Acts \(RSA\)\(Frank and Goodman,[2012](https://arxiv.org/html/2606.18624#bib.bib21); Goodman and Frank,[2016](https://arxiv.org/html/2606.18624#bib.bib1)\), which both cast pragmatic interpretation as recursive reasoning between speakers and listeners over communicative alternatives\.

Given the framing of pragmatic inference as a reasoning process, a natural question is whether recent advances in reasoning\-oriented post\-training can also teach counterfactual pragmatic reasoning\. Reinforcement Learning with Verifiable Rewards \(RLVR\) has driven progress in math and code by rewarding reasoning trajectories whose final answers can be checked by deterministic verifiers\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib8); Trunget al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib9)\)\. Pragmatic reasoning lacks this kind of verification signal: Whether an interpretation is correct often depends on subtle contextual assumptions, speaker goals, and social expectations, so the same utterance may support different meanings under small changes in context\(Friedet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib4); Anuranjanaet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib10)\)\. A second challenge is the scarcity of scalable pragmatic supervision\. Large\-scale human annotation is costly because annotators must judge not only surface correctness but also whether an interpretation is contextually licensed\. Distillation from stronger teachers is also an imperfect substitute: distilled performance is limited by the teacher’s ability to perform pragmatic reasoning, which may be imperfect even for frontier models\(Maet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib7); Sravanthiet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib6)\)\.

To address these challenges, we introducePragmaticReasoning viaSelf\-Training \(PragReST\), a framework for learning counterfactual pragmatic reasoning from self\-generated data\.PragReSTis self\-reinforcing in the sense that it does not rely on human\-labeled pragmatic training data, benchmark supervision, or a stronger external teacher model to distill pragmatic knowledge into the policy\. Instead, the same model is used throughout the pipeline\. As illustrated in Figure[2](https://arxiv.org/html/2606.18624#S2.F2),PragReSTproceeds in two stages\. First, the model constructs a pragmatics training set by generating situations, questions, and target interpretations from domain seeds, few\-shot examples, and descriptions of pragmatic phenomena\. The same model audits these generated instances to remove low\-quality, ambiguous, or invalid examples\. Second,PragReSTturns these filtered problems into training data for teaching models counterfactual pragmatic reasoning, following two training paradigms: supervised fine\-tuning \(SFT\) and reinforcement learning \(RL\)\. For SFT, we first generate privileged answer traces via a prompt that explicitly encourages the model to compare the observed utterance with plausible communicative alternatives, and then train the model on these traces with the prompt removed\. This teaches the model to internalize counterfactual reasoning\. In the RL stage, we further tune the model via GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib8)\), training the model on filtered problems using a self\-judged correctness reward\.

We usePragReSTto train two sizes of reasoning models, Qwen3\-8B and Qwen3\-14B\(Qwen Team,[2025](https://arxiv.org/html/2606.18624#bib.bib22)\)\. We evaluate across four pragmatic benchmarks:PragMega\(fine\-grained pragmatics QA; Huet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib14)\),Ludwig\(implicature interpretation; Ruiset al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib15)\),MetoQA\(metonymic reference resolution; Sravanthiet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib6)\), andAltPrag\(open\-ended pragmatic recovery; Yuet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib40)\)\. Our central scientific question is*what kind of reasoning matters for pragmatic inference\.*To answer this, we comparePragReSTagainst a non\-counterfactual variant that keeps the same data construction, filtering, and training recipe but replaces the counterfactual reasoning instructions with a generic pragmatic instruction\. This variant yields only limited average improvement over the instruct backbone\. In contrast,PragReSTimproves over the instruct backbone by an average of 5\.37% accuracy for Qwen3\-8B and 5\.50% for Qwen3\-14B across the three accuracy\-based benchmarks \(PragMega,Ludwig,MetoQA\)\.222Unless noted otherwise, all accuracy differences are reported in terms of absolute percentages\.This contrast identifies counterfactual reasoning – rather than additional data, task exposure, or RL optimization – as the critical ingredient for pragmatic reasoning\. In[Section˜5\.1](https://arxiv.org/html/2606.18624#S5.SS1), our error analysis shows thatPragReST’s gains are concentrated in failure modes involving literal interpretation or missed communicative intent, where the correct answer depends on contrasting the observed utterance with plausible alternatives\. This also suggests thatPragReST’s gains stem from the counterfactual reasoning\. In[Section˜5\.2](https://arxiv.org/html/2606.18624#S5.SS2), we also confirm that these gains do not substantially degrade out\-of\-domain performance on general knowledge and mathematical reasoning tasks from MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib46)\), MATH\-500\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib47); Hendryckset al\.,[2021](https://arxiv.org/html/2606.18624#bib.bib48)\), and AIME2025\(OpenCompass,[2025](https://arxiv.org/html/2606.18624#bib.bib49)\)\.

## 2Related Work

Pragmatics in Language Models\.Pragmatic interpretation has been studied as inference over speaker intentions, contextual alternatives, and shared background assumptions\. A canonical formalization is the Rational Speech Acts \(RSA\) framework, which models pragmatic interpretation as probabilistic, recursive reasoning between speakers and listeners about why a speaker chose one utterance over another in a given context\(Frank and Goodman,[2012](https://arxiv.org/html/2606.18624#bib.bib21); Goodman and Frank,[2016](https://arxiv.org/html/2606.18624#bib.bib1)\)\. This perspective has also informed computational work on pragmatically informed interpretation beyond classical reference games\(Friedet al\.,[2018](https://arxiv.org/html/2606.18624#bib.bib25); Vaduguruet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib26)\)\. Recent benchmarks show that even strong LLMs remain brittle on context\-dependent interpretation, including implicature, presupposition, deixis, metonymic reference, and open\-ended pragmatic recovery, with substantial and uneven gaps relative to humans across phenomena and settings\(Huet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib14); Ruiset al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib15); Sravanthiet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib6); Yuet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib40); Maet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib7); Friedet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib4)\)\. Recent training approaches improve pragmatic reasoning through task\-specific preference tuning or teacher\-generated thought supervision\(Wuet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib19); Sravanthiet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib45)\)\. These approaches demonstrate that pragmatic reasoning can benefit from specialized supervision, but they rely on external pragmatic data, preference annotations, or teacher\-generated rationales\. In contrast,PragReSTself\-generates training data and uses counterfactual reasoning as the organizing principle for generating, filtering, and reinforcing reasoning traces within a self\-contained training loop\.

![Refer to caption](https://arxiv.org/html/2606.18624v1/x2.png)Figure 2:A high\-level overview ofPragReST\. The process starts with the self\-generation of domain seeds\. The QA task prompt is constructed by combining a general instruction, a domain seed, and a pragmatic taxonomy description\. The generated data are filtered by a self\-judge\. For SFT, gold answers are pre\-generated with a counterfactual \(CF\) instruction, which is removed from the final SFT input\. We also apply a second filtering step to remove examples for which the model returns an incorrect answer even with the pragmatic instruction\.Self\-Improvement in Language Models\.Prior work on LLM self\-improvement trains on model\-generated data, rationales, or rewards, but has focused largely on domains with deterministic tasks and verifiable rewards\. Early rationale\-bootstrapping methods train models on self\-generated reasoning that leads to correct answers, while later work scales this paradigm through expectation\-maximization style self\-training, reinforced fine\-tuning, consistency\-based rationale filtering, unsupervised self\-training, and autonomous curriculum generation\(Zelikmanet al\.,[2022](https://arxiv.org/html/2606.18624#bib.bib16),[2024](https://arxiv.org/html/2606.18624#bib.bib23); Singhet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib54); Trunget al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib9); Leeet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib30); Xuet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib31); Sunet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib18); Zhaoet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib32)\)\. Synthetic\-data methods such as Self\-Instruct and Evol\-Instruct further show that LLMs can expand their own training distributions rather than merely relabel fixed datasets\(Wanget al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib33); Xuet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib34)\)\. These methods have primarily targeted math, code, or instruction\-following settings in which correctness can be checked by ground\-truth answers, executors, or verifier models;PragReSTextends this paradigm to open\-ended pragmatics, where both the task distribution and the correctness signal must be constructed: the model generates pragmatic QA instances, filters them with a self\-judge, and uses a constrained binary correctness judge rather than a general\-purpose quality or preference evaluator\. This connectsPragReSTto LLM\-based evaluators and self\-rewarding systems\(Liuet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib35); Kimet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib36); Yuanet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib24)\), while avoiding a fully general\-purpose judge formulation, which can be brittle across domains\(Zhenget al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib37); Huanget al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib38); Rainaet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib39)\)\.PragReSTis also related to self\-distillation with privileged information and context distillation, where a teacher policy is conditioned on information unavailable to the student at inference time\(Nguyenet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib50); Penalozaet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib51); Zhaoet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib53); Yeet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib52)\)\. InPragReST, the privileged signal is a counterfactual reasoning script used only during SFT data generation; the model is then trained on the original problem without this script, with a goal of internalizing the reasoning procedure\.

## 3Methodology

As illustrated in[Fig\.˜2](https://arxiv.org/html/2606.18624#S2.F2),PragReSTconsists of two stages\. First, the model generates pragmatics\-focused QA problems from domain seeds and pragmatic phenomenon descriptions, then filters them with a self\-judge to obtain the primary problem data\. Second, the model learns from this data through counterfactual bootstrapping and GRPO: SFT distills reasoning traces generated under a privileged counterfactual script, and GRPO reinforces pragmatically correct answers\. All supervision signals are produced without human\-labeled pragmatic data or a stronger external model\.

### 3\.1Data Generation

#### Prompt Construction\.

Our pipeline begins by generating a pool of short domain descriptors \(e\.g\., Healthy Meal Prep, Modern Travel Planner\)\. We then sample a pragmatic taxonomy category from the list introduced byMaet al\.\([2025](https://arxiv.org/html/2606.18624#bib.bib7)\):Context and Deixis,Implicature and Presupposition,Speech Acts and Intent Recognition,Discourse and Coherence,Social Pragmatics, andMetaphor, as illustrated in Step 1 of[Fig\.˜2](https://arxiv.org/html/2606.18624#S2.F2)\. For each domain–category pair, the model is prompted to generate an open\-ended QA item consisting of a pragmatic situation, a question grounded in that situation, and a target answer\. This differs from evaluation on fixed benchmark items\. Instead of recovering an implicit meaning from an externally provided context, the model constructs the scenario, question, and intended answer together under an explicit pragmatic category description\. The generated item can therefore be treated as a candidate locally grounded QA instance, closer to answer\-aware question generation and grounded synthetic QA generation than to ordinary benchmark inference\(Zhanget al\.,[2022](https://arxiv.org/html/2606.18624#bib.bib56); Radevskiet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib57)\)\. Because these candidates may still be noisy, we retain them only after filtering out examples that are malformed, ambiguous, answerable without pragmatic reasoning, or not pragmatically licensed\. This proposal\-and\-filtering strategy follows prior self\-generation and bootstrapping work, which uses model\-generated candidates for training only after automatic filtering or correctness checks\(Wanget al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib33); Zelikmanet al\.,[2022](https://arxiv.org/html/2606.18624#bib.bib16)\)\. We expand on data generation details in Appendix[A](https://arxiv.org/html/2606.18624#A1)\.

Self\-Filtering\.Using the prompts constructed in Step 1, the model generates candidate QA and dialogue instances\. Because self\-generated data can be noisy, we apply a filtering stage before training\. For each generated instance, we use the corresponding instruct backbone as a constrained binary judge\. The judge is prompted to output either*yes*or*no*, indicating whether the instance meets a set of manually defined quality criteria \(provided in Appendix[A\.3](https://arxiv.org/html/2606.18624#A1.SS3)\)\. To obtain a continuous judge signal, we derive a first\-token confidence marginm​\(q\)=p​\(yes∣q\)−p​\(no∣q\)m\(q\)=p\(\\texttt\{yes\}\\mid q\)\-p\(\\texttt\{no\}\\mid q\), whereqqis the generated QA item\. We interpret lower scores as lower\-confidence generations and discard the bottom50%50\\%of the data, ranked by margin\. To give a sense of typical scale, our Qwen3\-14B run starts from 1,000 domain seeds and yields 6,000 parseable QA items, of which 3,000 are retained after self\-filtering as the primary problem data\. For SFT target construction, 2,816 counterfactual responses pass the correctness filter, yielding 2,759 training examples and 57 held\-out synthetic validation examples\. The resulting dataset from this stage is referred to as theprimary problem data\. A human\-agreement calibration study is reported in Appendix[B\.3](https://arxiv.org/html/2606.18624#A2.SS3)\.

### 3\.2Training

We train the model in two stages\. The first stage uses supervised fine\-tuning \(SFT\) to internalize counterfactual pragmatic reasoning from generated demonstrations, and the second stage uses GRPO to reinforce pragmatically correct outcomes\. We describe the main design choices below and provide full training details in Appendix[B](https://arxiv.org/html/2606.18624#A2)\.

#### Supervised Fine\-Tuning with Counterfactual Bootstrapping\.

The goal of the SFT stage is to teach the model a reusable counterfactual reasoning procedure\. For each filtered training instance, we construct an augmented target\-generation prompt by prepending the counterfactual reasoning script and the corresponding pragmatic section description to the original problem\. Under this augmented prompt, the model generates candidate responses consisting of a reasoning trace and a final answer; the script encourages the model to interpret an utterance as a communicative choice, compare it with plausible alternatives, and infer the speaker’s intended meaning from that contrast\(Goodman and Frank,[2016](https://arxiv.org/html/2606.18624#bib.bib1); Friedet al\.,[2018](https://arxiv.org/html/2606.18624#bib.bib25); Vaduguruet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib26); Tsvilodubet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib27)\)\. We then apply the self\-judge to retain only candidate responses whose final answer is judged pragmatically correct with respect to the context, question, and reference answer\. After filtering, each accepted response is paired with the unaugmented original problem\. Both the counterfactual script and the pragmatic section description are removed from the student input\. Thus, the script acts as privileged scaffolding for constructing SFT targets, not as an inference\-time prompt, and training distills the resulting counterfactual reasoning behavior into the model so that it can recover the same reasoning pattern from the original input alone\. Details on SFT data construction and hyperparameters are given in Appendix[B\.4](https://arxiv.org/html/2606.18624#A2.SS4)\.

#### Reinforcement Learning with GRPO\.

After SFT, the model has learned from counterfactual reasoning traces, but this stage uses correctness only as a criterion for selecting demonstrations rather than as an objective optimized during training\. We therefore initialize from the SFT checkpoint and apply GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib8)\)on the filtered primary problem data\. Before GRPO training, we apply an offline difficulty filter, following DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib29)\)\. For each prompt, we sampleGGrollouts from the initial SFT model and discard*easy*prompts on which every rollout already passes the correctness judge, since these zero\-variance groups provide no learning signal\. Note that we keep prompts for which the SFT model gets zero reward, as these may still yield some signal later from GRPO\-trained checkpoints\. During GRPO training, we keep the filtered prompt set fixed but resample rollouts online as the model is updated\. For each remaining prompt, the current policy draws a fresh group ofGGcandidate responses online, which are scored using a composite reward that combines output\-format compliance with pragmatic answer correctness\. The correctness component reuses the same correctness judge and first\-token margin construction defined in Appendix[B\.2](https://arxiv.org/html/2606.18624#A2.SS2), comparing the extracted final answer against the reference answer\. This training reinforces responses that recover the intended pragmatic interpretation while regularizing the policy toward the SFT checkpoint\. Full GRPO details and hyperparameters are provided in Appendix[B\.6](https://arxiv.org/html/2606.18624#A2.SS6)\.

Table 1:Performance across benchmarks and models under greedy decoding\.PragMega,Ludwig, andMetoQAreport accuracy;AltPragreports its reference\-based score\. Values are point estimates with bootstrap standard errors over examples\. We bold the best point estimate for each model size and benchmark\.

## 4Experiments and Results

#### Benchmarks and Models\.

We evaluate our method on four benchmarks selected to directly test pragmatic interpretation:PragMega,Ludwig,MetoQA, andAltPrag\.PragMega\(Huet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib14)\)is a QA benchmark for pragmatic language understanding spanning multiple pragmatic phenomena\.Ludwig\(Ruiset al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib15)\)evaluates implicature interpretation as a binary decision over whether a listener’s response should be interpreted as yes or no\.MetoQA\(Sravanthiet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib6)\)evaluates metonymic reference resolution, where the model must infer the intended referent behind a contextually associated expression\.AltPrag\(Yuet al\.,[2026](https://arxiv.org/html/2606.18624#bib.bib40)\)evaluates open\-ended pragmatic recovery, where the model must produce an appropriate interpretation of the implied meaning\. Representative examples from each benchmark are provided in[Table˜9](https://arxiv.org/html/2606.18624#A3.T9)\. ForPragMega,Ludwig, andMetoQA, we report accuracy; forAltPrag, we report the benchmark’s reference\-based evaluation score and pairwise comparisons using the original GPT\-4\.1 judge and scoring protocol released by the benchmark authors\. We conduct experiments with two instruct backbone models: Qwen3\-8B and Qwen3\-14B\. We apply the same training pipeline to each model and evaluate the resulting models under the same benchmark settings\.

#### Baselines\.

We comparePragReSTagainst four baselines\. The first is the Instruct backbone\. The second isDeep\-layer\-DPO, followingWuet al\.\([2024](https://arxiv.org/html/2606.18624#bib.bib19)\), a layer\-restricted DPO method designed for social and pragmatic inference\. For this baseline, we follow the strongest setting reported inWuet al\.\([2024](https://arxiv.org/html/2606.18624#bib.bib19)\)and train on SocialIQA\(Sapet al\.,[2019](https://arxiv.org/html/2606.18624#bib.bib13)\), a human\-annotated social inference dataset\. To adapt this baseline to reasoning models, we generate gold\-guided reasoning traces by providing the correct answer during trace generation and train on the resulting reasoning\-answer outputs\. The third isIMP\-SFT, followingSravanthiet al\.\([2025](https://arxiv.org/html/2606.18624#bib.bib45)\), which distills GPT\-4o\-mini\-generated rationales for pragmatic understanding\. For this baseline, we preserve the model’s original reasoning behavior by masking the loss on the model\-generated reasoning part, then appending the GPT\-4o\-mini rationale and training on the rationale paired with the correct label\.Deep\-layer\-DPOandIMP\-SFTrepresent two recent training\-based approaches to pragmatic improvement that rely on external supervision: human\-annotated social inference data and teacher\-rationale distillation, respectively\. Finally, we include non\-counterfactual variants of our own pipeline, which keep the same data generation, filtering, SFT, and GRPO stages asPragReSTbut replace the counterfactual reasoning instruction with a generic pragmatic reasoning instruction\. This isolates the effect of counterfactual reasoning from the effect of self\-training and RL post\-training\. Counterfactual and non\-counterfactual prompts are in Appendix[B\.1](https://arxiv.org/html/2606.18624#A2.SS1)\.

### 4\.1Results

Table[1](https://arxiv.org/html/2606.18624#S3.T1)shows thatPragReSTconsistently improves pragmatic reasoning across the four evaluation benchmarks and both model sizes\. For Qwen3\-8B,PragReST\-GRPO achieves the best result in every benchmark, improving over the instruct backbone onPragMega,Ludwig,MetoQA, andAltPrag\. Across the three accuracy\-based benchmarks, this corresponds to an average gain of 5\.37% over the Instruct backbone; onAltPrag, which uses a reference\-based score,PragReST\-GRPO improves from 7\.24 to 7\.62\. For Qwen3\-14B,PragReST\-GRPO again gives the best performance across all reported benchmarks, with an average gain of 5\.50% across the three accuracy\-based benchmarks and anAltPragimprovement from 7\.78 to 8\.14\. These results suggest thatPragReSTimproves multiple forms of pragmatic interpretation, including fine\-grained QA, implicature resolution, metonymic reference, and open\-ended pragmatic recovery\.

#### Comparison with Human Performance

To contextualize the remaining headroom on the accuracy\-based pragmatic benchmarks, we comparePragReST\-GRPO with the human\-performance estimates reported or computable from the corresponding benchmark resources\. We restrict this comparison toPragMega,Ludwig, andMetoQA, since these benchmarks are evaluated with accuracy\. We do not includeAltPragin this table because it uses a reference\-based open\-ended scoring protocol rather than a directly comparable human accuracy score\. As shown in Table[2](https://arxiv.org/html/2606.18624#S4.T2), Qwen3\-14BPragReST\-GRPO reaches performance close to the human estimates available on all three accuracy\-based benchmarks\. InPragMega, the human score computed using the code of the benchmark authors over our evaluated subset is 86\.37, compared to 85\.80 forPragReST\-GRPO\. OnLudwig,PragReST\-GRPO reaches 86\.50, slightly above the reported human average of 86\.2\. InMetoQA,PragReST\-GRPO reaches 80\.31, close to the reported human score of 80\.0\. These comparisons suggest that, for the accuracy\-based benchmarks,PragReSToperates in a regime of near\-human performance, which may partly explain why absolute gains over strong instruction\-tuned backbones are modest on some tasks\.

Table 2:Comparison between Qwen3\-14BPragReSTand human\-performance estimates on the three accuracy\-based pragmatic benchmarks\. ThePragMegahuman score is computed using the benchmark authors’ data\. TheLudwigandMetoQAhuman scores are those reported in the corresponding papers\.
#### Comparison with External Baselines\.

PragReSTalso compares favorably against prior task\-specific pragmatic tuning methods\. For Qwen3\-8B,Deep\-layer\-DPOunderperforms the instruct backbone onPragMega,Ludwig, andMetoQA\. MeanwhileIMP\-SFTimproves over the instruct model onLudwig, because it contains an augmented version ofLudwigin its training data, but its performance does not transfer to other benchmarks\. Overall, these comparisons underscore thatPragReSTimproves over prior pragmatic supervision methods despite learning from a self\-generated signal rather than relying on human\-annotated data or a stronger teacher\.

![Refer to caption](https://arxiv.org/html/2606.18624v1/x3.png)Figure 3:Pairwise preference on ALTPRAG: each bar comparesPragReST\-GRPO against a Qwen3\-14B baseline, using GPT\-4\.1 as a blind pairwise judge over the two models’ generated answers\.
#### Preference Evaluation onAltPrag\.

BecauseAltPragrequires open\-ended pragmatic recovery, we complement its reference\-based score with a blind GPT\-4\.1 pairwise comparison over model outputs, following the benchmark’s original evaluation setup\. As shown in Figure[3](https://arxiv.org/html/2606.18624#S4.F3),PragReST\-GRPO is preferred over the instruct backbone in 67\.41% of decided comparisons, over the external baselines, over the non\-counterfactual variants, and overPragReST\-SFT in 53\.85%\. This indicates that theAltPragimprovement is not only a scalar\-score shift: when full interpretations are compared directly,PragReST\-GRPO is more often judged to recover the intended pragmatic meaning\.

#### Importance of Counterfactual Supervision\.

The non\-counterfactual variants provide a controlled test of whether the gains come from self\-training alone or from the counterfactual structure of the supervision\. They use the same generated problem distribution, filtering procedure, SFT stage, and GRPO stage asPragReST, but ask the model to reason about the pragmatic meaning of the utterance in context, without providing the explicit counterfactual scaffold used inPragReST\. Their weaker performance shows that additional pragmatic\-domain self\-training is not sufficient by itself: the self\-improvement loop becomes effective when the training signal teaches the model to contrast the observed utterance with plausible communicative alternatives\. At the same time,PragReST\-GRPO improves overPragReST\-SFT on the primary comparison, indicating that outcome\-based reinforcement adds gains beyond imitation of counterfactual traces\. Taken together, these results suggest that the two stages play complementary roles: SFT gives the model a counterfactual reasoning procedure, while GRPO reinforces when and how to apply that procedure to recover pragmatically correct interpretations\.

Table 3:Mean and standard deviation across three independent Qwen3\-14B training runs\. All evaluations are under greedy decoding\. The Instruct row is the fixed, untrained base model\. Under greedy decoding, its score is deterministic and it is only evaluated once, whilePragReST\-SFT andPragReST\-GRPO vary across the three independent training runs\.Table 4:Performance ofPragReSTonGemma\-4\-E4BandGPT\-OSS\-20B\. Values are point estimates with bootstrap standard errors over examples\.
#### Robustness and Generalization\.

We further check whetherPragReSTdepends on a single favorable run or on the Qwen3 model family\. Across three independent Qwen3\-14B runs with different seeds for data generation, training, and sampling, bothPragReST\-SFT andPragReST\-GRPO remain above the instruct backbone on average, with modest run\-to\-run variation \(Table[3](https://arxiv.org/html/2606.18624#S4.T3)\)\. We also evaluatePragReSTon two additional backbones,Gemma\-4\-E4BandGPT\-OSS\-20B\. ForGPT\-OSS\-20B, due to its larger model size, we set the reasoning\-effort parameter to low and train LoRA adapters while keeping the same data generation procedure and training objectives\. As shown in Table[4](https://arxiv.org/html/2606.18624#S4.T4), the same overall pattern holds across both models: counterfactual SFT improves over the base model, and GRPO generally provides further gains\. OnGemma\-4\-E4B,PragReST\-GRPO improves over the Instruct backbone by an average of 5\.28% across the three accuracy\-based benchmarks and raises theAltPragscore from 7\.39 to 7\.72\. OnGPT\-OSS\-20B,PragReST\-GRPO improves over the Base model by an average of 6\.72% across the three accuracy\-based benchmarks and raises theAltPragscore from 7\.41 to 7\.47\. These additional runs suggest that the gains are not driven by one Qwen3 training run or by the Qwen3 architecture alone\.

## 5Discussion and Analysis

Our results in[Table˜1](https://arxiv.org/html/2606.18624#S3.T1)and[Fig\.˜3](https://arxiv.org/html/2606.18624#S4.F3)show thatPragReSTimproves performance across multiple pragmatic reasoning tasks and that these gains are largest when training includes counterfactual reasoning over communicative alternatives\. More broadly, we argue that this suggests a critical relationship between counterfactual reasoning and pragmatics: we only see improvements when this relationship is encoded inPragReST\. To that end, we analyze wherePragReST’s gains stem from\. Additionally, we show that training models for pragmatic reasoning still preserves their broader reasoning and knowledge capabilities\.

### 5\.1Counterfactual Reasoning and Error Reduction

We test whether the accuracy gains indeed arise from the counterfactual mechanism, as hypothesized\. IfPragReSTimproves pragmatic reasoning via counterfactual reasoning, its gains should not be distributed uniformly across all mistakes\. Instead, the largest error reductions should occur for error types that involve a failure to compare the observed utterance with plausible alternatives\. We therefore analyzePragMegaerrors before and afterPragReST, induce a taxonomy of recurring failure modes, validate the annotations against human labels, and relate each error type to the amount of counterfactual reasoning observed\. See Appendix[D](https://arxiv.org/html/2606.18624#A4)for further details on taxonomy construction and validation\.

#### Inducing an Error Taxonomy\.

To systematically analyze the models’ shortcomings, we construct a diagnostic taxonomy of pragmatic reasoning failures from incorrectPragMegaoutputs\. First, we collect failure cases from the evaluated Qwen3 models\. Each case includes the original prompt, answer options, gold answer, model prediction, phenomenon type, and an excerpt of the model output\. We then split these cases into batches and prompt an LLM \(GPT\-4\.1\-mini\) to propose recurring error categories for each batch, without assigning labels to individual examples\. The prompt asks for categories that explain the underlying pragmatic reasoning failure and that generalize across benchmark phenomena\. Next, we run a second LLM consolidation step over the batch\-level taxonomies to produce a compact set of reusable error types\. We then fix the final taxonomy used in analysis to five categories:*literal/surface bias, missed communicative intent, unsupported or overextended inference, coherence\-bridge error*, and*figurative or humor mapping error*\. Definitions of each can be found in Appendix[D\.1](https://arxiv.org/html/2606.18624#A4.SS1.SSS0.Px2)\. After fixing the taxonomy, an LLM annotates all failure cases using these categories, with potentially multiple labels per example\.

#### Counterfactual Reasoning Score\.

In addition to error tags, each reasoning trace is automatically scored for the presence of counterfactual pragmatic reasoning\. We prompt an LLM judge to flag whether the trace considers relevant alternative utterances or interpretations, contrasts literal and intended meanings, identifies mismatches between what was said and what would have been said under a literal interpretation, and uses the speaker’s communicative choice to infer intent\. A higher score means more counterfactual reasoning\.

#### Validating Automatic Annotation\.

Because full manual annotation is costly, we use GPT\-4\.1\-mini labels for the full diagnostic analysis\. We validate this choice with a blind agreement study on a shared subset of 40 error samples, labeled by two project annotators and one additional annotator who was not involved in the project\. Human–human agreement is 83\.8% with an average Micro Cohen’sκ\\kappaof 0\.628, while human–GPT agreement is 82\.6% with an average Micro Cohen’sκ\\kappaof 0\.614\. We report the full agreement analysis in Appendix[D\.2](https://arxiv.org/html/2606.18624#A4.SS2.SSS0.Px2)\.

![Refer to caption](https://arxiv.org/html/2606.18624v1/x4.png)Figure 4:Tagged error counts before and afterPragReSTtraining\. Bars show the number of incorrectPragMegaexamples assigned to each error category for the Instruct model andPragReST\.![Refer to caption](https://arxiv.org/html/2606.18624v1/x5.png)Figure 5:Error reduction by failure mode\. The x\-axis shows the change in mean counterfactual\-reasoning score from the Instruct model toPragReST, and the y\-axis shows the reduction in tagged errors\. The numbers inside indicate how many errors are associated with each phenomenon covered inPragMega\.
#### Error Reduction Aligns with Counterfactual Reasoning\.

As shown in[Fig\.˜4](https://arxiv.org/html/2606.18624#S5.F4),PragReSTreduces the dominant counterfactual\-pragmatic failure modes: missed communicative intent drops from 40 to 22, literal/surface bias from 30 to 15, and figurative/humor mapping from 8 to 4\. These categories require the model to move beyond literal compatibility and infer why the speaker chose the observed utterance rather than a more direct alternative\. The same pattern appears in[Fig\.˜5](https://arxiv.org/html/2606.18624#S5.F5): error reductions correlate with increases in counterfactual reasoning, suggesting that the gains are tied to more explicit reasoning over communicative alternatives\. At the same time, unsupported\-inference and coherence\-bridge errors do not decrease, suggesting that counterfactual reasoning alone is not sufficient when the main challenge is determining whether an inferred alternative is supported by the discourse context\.

### 5\.2Out\-of\-Domain Evaluation

Table 5:Out\-of\-domain evaluation for Qwen3\-8B and Qwen3\-14B models\. MMLU\-Pro accuracy is computed on a 10% subset sampled from each subject\.A common concern with task\-specific post\-training is that improvements on the target domain may come at the cost of broader model capability\. We therefore evaluate whetherPragReSTpreserves out\-of\-domain performance on general knowledge, mathematical reasoning, and factual truthfulness tasks\. Specifically, we evaluate on MMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib46)\)using a 10% subset sampled from each subject, on MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.18624#bib.bib48); Lightmanet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib47)\)and AIME2025\(OpenCompass,[2025](https://arxiv.org/html/2606.18624#bib.bib49)\), which test multi\-step mathematical reasoning, and on TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2606.18624#bib.bib28)\)which measures factual knowledge\. For MMLU\-Pro, we report accuracy; in math domains, we report pass@8; for TruthfulQA, we report the MC2 score\. We comparePragReSTagainst the original instruct backbone in Table[5](https://arxiv.org/html/2606.18624#S5.T5)\. Across MMLU\-Pro, MATH\-500, AIME2025, and TruthfulQA, performance stays close to the instruct baseline and does not show a consistent downward trend across model sizes or task types\. These results suggest thatPragReSTimproves pragmatic reasoning without a systematic loss in out\-of\-domain knowledge, mathematical reasoning ability, or factual truthfulness\.

## 6Conclusion

We introducedPragReST, a self\-reinforcing framework for improving pragmatic reasoning through self\-generated counterfactual supervision, without human\-labeled pragmatic data or stronger teacher models\. Across pragmatic benchmarks,PragReSTimproves over backbone models, prior pragmatic tuning baselines, and non\-counterfactual variants, with analyses showing that the gains concentrate in cases requiring comparison between what a speaker said and what they could have said under alternative intentions\. These results suggest that reinforcement\-based self\-improvement can extend beyond formally verifiable domains toward socially grounded language understanding\.

## Limitations

AlthoughPragReSTimproves pragmatic reasoning, some limitations remain\. First,PragReSTdoes not uniformly reduce all types of pragmatic errors\. In our error analysis, literal/surface\-bias and missed\-intent errors decrease substantially, while unsupported\-inference and coherence\-bridge errors are not consistently reduced\. These remaining errors suggest that generating plausible communicative alternatives is not sufficient on its own: the model must also determine whether those alternatives are supported by the specific discourse context\. Future work could therefore incorporate stronger evidence\-checking mechanisms when constructing or using counterfactual alternatives\.

Second, our evaluation is limited to English\-language pragmatic benchmarks\. This follows the available benchmark setting and allows controlled comparison with prior work, but pragmatic interpretation is strongly shaped by language, culture, social norms, and conversational conventions\(Friedet al\.,[2023](https://arxiv.org/html/2606.18624#bib.bib4); Maet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib7)\)\. As a result, our findings do not establish that the same counterfactual training procedure transfers to multilingual or culturally variable pragmatic settings\. ExtendingPragReSTto non\-English and cross\-cultural pragmatics is an important direction for future work\.

## Acknowledgments

We would like to thank Jessy Li for her helpful feedback, and Ananya Sahu for providing annotations\.

## References

- K\. Anuranjana, S\. Mallepally, S\. Mareddy, A\. Shukla, and R\. Mamidi \(2024\)Survey on Computational Approaches to Implicature\.InProceedings of the 21st International Conference on Natural Language Processing \(ICON\),S\. Lalitha Devi and K\. Arora \(Eds\.\),AU\-KBC Research Centre, Chennai, India,pp\. 224–229\.External Links:[Link](https://aclanthology.org/2024.icon-1.25/)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p3.1)\.
- M\. C\. Frank and N\. D\. Goodman \(2012\)Predicting pragmatic reasoning in language games\.Science336\(6084\),pp\. 998–998\.External Links:[Document](https://dx.doi.org/10.1126/science.1218633),[Link](https://www.science.org/doi/abs/10.1126/science.1218633),https://www\.science\.org/doi/pdf/10\.1126/science\.1218633Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p2.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1)\.
- M\. Franke \(2009\)Signal to act: game theory in pragmatics\.University of Amsterdam\.External Links:[Link](https://eprints.illc.uva.nl/id/eprint/2081/)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p2.1)\.
- D\. Fried, J\. Andreas, and D\. Klein \(2018\)Unified Pragmatic Models for Generating and Following Instructions\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1951–1963\.External Links:[Link](https://aclanthology.org/N18-1177/),[Document](https://dx.doi.org/10.18653/v1/N18-1177)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px1.p1.1)\.
- D\. Fried, N\. Tomlin, J\. Hu, R\. Patel, and A\. Nematzadeh \(2023\)Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches\.InFindings of the Association for Computational Linguistics 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12619–12640\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.840/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.840)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p3.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[Limitations](https://arxiv.org/html/2606.18624#Sx1.p2.1)\.
- N\. D\. Goodman and M\. C\. Frank \(2016\)Pragmatic Language Interpretation as Probabilistic Inference\.Trends in Cognitive Sciences20\(11\),pp\. 818–829\(English\)\.External Links:ISSN 1364\-6613, 1879\-307X,[Link](https://www.cell.com/trends/cognitive-sciences/abstract/S1364-6613(16)30122-X),[Document](https://dx.doi.org/10.1016/j.tics.2016.08.005)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p2.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px1.p1.1)\.
- H\. P\. Grice \(1975\)Logic and Conversation\.InSpeech Acts,\(en\)\.External Links:[Link](https://brill.com/display/book/edcoll/9789004368811/BP000003.xml),[Document](https://dx.doi.org/10.1163/9789004368811%5F003)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1)\.
- A\. W\. He, D\. Fried, and S\. Welleck \(2025\)Rewarding the unlikely: lifting GRPO beyond distribution sharpening\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 25548–25560\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1298/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1298),ISBN 979\-8\-89176\-332\-6Cited by:[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.SSS0.Px5.p1.9)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=7Bywt2mQsCe)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§5\.2](https://arxiv.org/html/2606.18624#S5.SS2.p1.1)\.
- J\. Hu, S\. Floyd, O\. Jouravlev, E\. Fedorenko, and E\. Gibson \(2023\)A fine\-grained comparison of pragmatic language understanding in humans and language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 4194–4213\.External Links:[Link](https://aclanthology.org/2023.acl-long.230/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.230)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px1.p1.1)\.
- H\. Huang, X\. Bu, H\. Zhou, Y\. Qu, J\. Liu, M\. Yang, B\. Xu, and T\. Zhao \(2025\)An Empirical Study of LLM\-as\-a\-Judge for LLM Evaluation: Fine\-tuned Judge Model is not a General Substitute for GPT\-4\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5880–5895\.External Links:ISBN 979\-8\-89176\-256\-5,[Link](https://aclanthology.org/2025.findings-acl.306/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.306)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo \(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8euJaTveKw)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- E\. Kulakova and M\. S\. Nieuwland \(2016\)Pragmatic skills predict online counterfactual comprehension: evidence from the N400\.Cognitive, Affective, & Behavioral Neuroscience16\(5\),pp\. 814–824\.External Links:[Document](https://dx.doi.org/10.3758/s13415-016-0433-4)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p2.1)\.
- J\. Lee, K\. Sakaguchi, and J\. Bak \(2025\)Self\-Training Meets Consistency: Improving LLMs’ Reasoning with Consistency\-Driven Rationale Evaluation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 10519–10539\.External Links:ISBN 979\-8\-89176\-189\-6,[Link](https://aclanthology.org/2025.naacl-long.528/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.528)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- S\. C\. Levinson \(1983\)Pragmatics\.Cambridge University Press\(en\)\.Note:ISBN: 9780511813313External Links:[Link](https://www.cambridge.org/highereducation/books/pragmatics/6D0011901AE9E92CBC1F5F21D7C598C3),[Document](https://dx.doi.org/10.1017/CBO9780511813313)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§5\.2](https://arxiv.org/html/2606.18624#S5.SS2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3214–3252\.External Links:[Link](https://aclanthology.org/2022.acl-long.229/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229)Cited by:[§5\.2](https://arxiv.org/html/2606.18624#S5.SS2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- B\. Ma, Y\. Li, W\. Zhou, Z\. Gong, Y\. J\. Liu, K\. Jasinskaja, A\. Friedrich, J\. Hirschberg, F\. Kreuter, and B\. Plank \(2025\)Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8679–8696\.External Links:ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.425/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.425)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p3.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1.p1.1),[Limitations](https://arxiv.org/html/2606.18624#Sx1.p2.1)\.
- D\. Nguyen, H\. Xiao, A\. Prasad, Z\. Khan, A\. Das, A\. Zhang, S\. Sahu, H\. Lee, E\. Stengel\-Eskin, and M\. Bansal \(2026\)AVSD: adaptive\-view self\-distillation by balancing consensus and teacher\-specific privileged signals\.External Links:2605\.20643,[Link](https://arxiv.org/abs/2605.20643)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- OpenCompass \(2025\)AIME 2025 dataset\.Note:[https://huggingface\.co/datasets/opencompass/AIME2025](https://huggingface.co/datasets/opencompass/AIME2025)Hugging Face datasetCited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§5\.2](https://arxiv.org/html/2606.18624#S5.SS2.p1.1)\.
- E\. Penaloza, D\. Vattikonda, N\. Gontier, A\. Lacoste, L\. Charlin, and M\. Caccia \(2026\)Privileged information distillation for language models\.External Links:2602\.04942,[Link](https://arxiv.org/abs/2602.04942)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1)\.
- G\. Radevski, K\. Gashteovski, S\. Syed, C\. Malon, S\. Nicolas, C\. Hung, T\. Sztyler, V\. Heußer, W\. Ben Rim, M\. Enomoto, K\. Takeoka, M\. Oyamada, G\. Glavaš, and C\. Lawrence \(2025\)On Synthesizing Data for Context Attribution in Question Answering\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16929–16950\.External Links:ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.828/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.828)Cited by:[§3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1.p1.1)\.
- V\. Raina, A\. Liusie, and M\. Gales \(2024\)Is LLM\-as\-a\-Judge Robust? Investigating Universal Adversarial Attacks on Zero\-shot LLM Assessment\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7499–7517\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.427/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.427)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- L\. Ruis, A\. Khan, S\. Biderman, S\. Hooker, T\. Rocktäschel, and E\. Grefenstette \(2023\)The Goldilocks of Pragmatic Understanding: Fine\-Tuning Strategy Matters for Implicature Resolution by LLMs\.Advances in Neural Information Processing Systems36,pp\. 20827–20905\(en\)\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/4241fec6e94221526b0a9b24828bb774-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px1.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. Le Bras, and Y\. Choi \(2019\)Social IQa: Commonsense Reasoning about Social Interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 4463–4473\.External Links:[Link](https://aclanthology.org/D19-1454/),[Document](https://dx.doi.org/10.18653/v1/D19-1454)Cited by:[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.SSS0.Px5.p1.9),[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p3.1),[§1](https://arxiv.org/html/2606.18624#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px2.p1.2)\.
- A\. Singh, J\. D\. Co\-Reyes, R\. Agarwal, A\. Anand, P\. Patil, X\. Garcia, P\. J\. Liu, J\. Harrison, J\. Lee, K\. Xu, A\. T\. Parisi, A\. Kumar, A\. A\. Alemi, A\. Rizkowsky, A\. Nova, B\. Adlam, B\. Bohnet, G\. F\. Elsayed, H\. Sedghi, I\. Mordatch, I\. Simpson, I\. Gur, J\. Snoek, J\. Pennington, J\. Hron, K\. Kenealy, K\. Swersky, K\. Mahajan, L\. Culp, L\. Xiao, M\. L\. Bileschi, N\. Constant, R\. Novak, R\. Liu, T\. Warkentin, Y\. Qian, Y\. Bansal, E\. Dyer, B\. Neyshabur, J\. Sohl\-Dickstein, and N\. Fiedel \(2024\)Beyond human data: scaling self\-training for problem\-solving with language models\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=lNAyUngGFK)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- S\. Sravanthi, M\. Doshi, P\. Tankala, R\. Murthy, R\. Dabre, and P\. Bhattacharyya \(2024\)PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12075–12097\.External Links:[Link](https://aclanthology.org/2024.findings-acl.719/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.719)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p3.1),[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px1.p1.1)\.
- S\. L\. Sravanthi, K\. Maharaj, S\. Gunnu, A\. Mishra, and P\. Bhattacharyya \(2025\)Understand the Implication: Learning to Think for Pragmatic Understanding\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 23778–23790\.External Links:ISBN 979\-8\-89176\-256\-5,[Link](https://aclanthology.org/2025.findings-acl.1218/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1218)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[Table 1](https://arxiv.org/html/2606.18624#S3.T1.12.12.5),[Table 1](https://arxiv.org/html/2606.18624#S3.T1.40.40.5),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px2.p1.1)\.
- G\. Srivastava, Z\. Bi, M\. Lu, and X\. Wang \(2025\)DEBATE, train, evolve: self\-evolution of language model reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 32764–32810\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1666/)Cited by:[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.SSS0.Px5.p1.9)\.
- Y\. Sun, M\. Chen, T\. Zhao, R\. Xu, Z\. Zhang, and J\. Yin \(2025\)The Self\-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6501–6512\.External Links:ISBN 979\-8\-89176\-256\-5,[Link](https://aclanthology.org/2025.findings-acl.337/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.337)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- L\. Trung, X\. Zhang, Z\. Jie, P\. Sun, X\. Jin, and H\. Li \(2024\)ReFT: reasoning with reinforced fine\-tuning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7601–7614\.External Links:[Link](https://aclanthology.org/2024.acl-long.410/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.410)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p1.1),[§1](https://arxiv.org/html/2606.18624#S1.p3.1),[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- P\. Tsvilodub, K\. Gandhi, H\. Zhao, J\. Fränken, M\. Franke, and N\. D\. Goodman \(2025\)Non\-literal understanding of number words by language models\.InProceedings of the 47th Annual Conference of the Cognitive Science Society,External Links:[Link](https://arxiv.org/abs/2502.06204)Cited by:[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px1.p1.1)\.
- S\. Vaduguru, D\. Fried, and Y\. Pu \(2024\)Generating pragmatic examples to train neural program synthesizers\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yxKZGQLzOP)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px1.p1.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 13484–13508\.External Links:[Link](https://aclanthology.org/2023.acl-long.754/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.754)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1),[§3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Document](https://dx.doi.org/10.52202/079017-3018),[Link](https://papers.nips.cc/paper_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets_and_Benchmarks_Track.html)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§5\.2](https://arxiv.org/html/2606.18624#S5.SS2.p1.1)\.
- S\. Wu, S\. Yang, Z\. Chen, and Q\. Su \(2024\)Rethinking Pragmatics in Large Language Models: Towards Open\-Ended Evaluation and Preference Tuning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 22583–22599\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1258/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1258)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[Table 1](https://arxiv.org/html/2606.18624#S3.T1.36.36.5),[Table 1](https://arxiv.org/html/2606.18624#S3.T1.8.8.5),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px2.p1.1)\.
- H\. Xin, Z\. Z\. Ren, J\. Song, Z\. Shao, W\. Zhao, H\. Wang, B\. Liu, L\. Zhang, X\. Lu, Q\. Du, W\. Gao, H\. Zhang, Q\. Zhu, D\. Yang, Z\. Gou, Z\. F\. Wu, F\. Luo, and C\. Ruan \(2025\)DeepSeek\-Prover\-V1\.5: harnessing proof assistant feedback for reinforcement learning and monte\-carlo tree search\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=I4YAIwrsXa)Cited by:[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.SSS0.Px5.p1.9)\.
- C\. Xu, Q\. Sun, K\. Zheng, X\. Geng, P\. Zhao, J\. Feng, C\. Tao, Q\. Lin, and D\. Jiang \(2024\)WizardLM: empowering large pre\-trained language models to follow complex instructions\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CfXh93NDgH)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- F\. Xu, H\. Yan, C\. Ma, H\. Zhao, Q\. Sun, K\. Cheng, J\. He, J\. Liu, and Z\. Wu \(2025\)Genius: A Generalizable and Purely Unsupervised Self\-Training Framework For Advanced Reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 13153–13167\.External Links:ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.644/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.644)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei \(2026\)On\-policy context distillation for language models\.External Links:2602\.12275,[Link](https://arxiv.org/abs/2602.12275)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- K\. Yu, Q\. Zeng, W\. Xuan, W\. Li, J\. Wu, and R\. Voigt \(2026\)The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 192–213\.External Links:ISBN 979\-8\-89176\-380\-7,[Link](https://aclanthology.org/2026.eacl-long.9/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.9)Cited by:[§1](https://arxiv.org/html/2606.18624#S1.p5.1),[§2](https://arxiv.org/html/2606.18624#S2.p1.1),[§4](https://arxiv.org/html/2606.18624#S4.SS0.SSS0.Px1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, J\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, R\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, Y\. Wu, and M\. Wang \(2025\)DAPO: An Open\-Source LLM Reinforcement Learning System at Scale\.Advances in Neural Information Processing Systems38,pp\. 113222–113244\(en\)\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/hash/a4277440d50f1f15d2cb4c14f7e0c0d2-Abstract-Conference.html)Cited by:[§B\.6](https://arxiv.org/html/2606.18624#A2.SS6.SSS0.Px1.p1.7),[§3\.2](https://arxiv.org/html/2606.18624#S3.SS2.SSS0.Px2.p1.2)\.
- W\. Yuan, R\. Y\. Pang, K\. Cho, X\. Li, S\. Sukhbaatar, J\. Xu, and J\. E\. Weston \(2024\)Self\-rewarding language models\.InInternational Conference on Machine Learning,pp\. 57905–57923\.External Links:[Link](https://proceedings.mlr.press/v235/yuan24d.html)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- E\. Zelikman, G\. R\. Harik, Y\. Shao, V\. Jayasiri, N\. Haber, and N\. Goodman \(2024\)Quiet\-STaR: language models can teach themselves to think before speaking\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=oRXPiSOGH9)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1),[§3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1.p1.1)\.
- R\. Zhang, J\. Guo, L\. Chen, Y\. Fan, and X\. Cheng \(2022\)A Review on Question Generation from Natural Language Text\.ACM Transactions on Information Systems40\(1\),pp\. 14:1–14:43\.External Links:[Link](https://dl.acm.org/doi/10.1145/3468889?utm_source=chatgpt.com),[Document](https://dx.doi.org/10.1145/3468889)Cited by:[§3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1.p1.1)\.
- A\. Zhao, Y\. Wu, T\. Wu, Q\. Xu, Y\. Yue, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang \(2025\)Absolute zero: reinforced self\-play reasoning with zero data\.InAdvances in Neural Information Processing Systems,Vol\.38\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2025/hash/9837dc00ff67d176373268ed48042d49-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.External Links:2601\.18734,[Link](https://arxiv.org/abs/2601.18734)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.
- A\. A\. Zheng, J\. J\. Li, and D\. I\. Beaver \(2026\)Strategic dialogue assessment: the crooked path to innocence\.Dialogue & Discourse17,pp\. 1–53\.External Links:[Link](https://aclanthology.org/2026.dnd-17.1/),[Document](https://dx.doi.org/10.5210/dad.2026.101)Cited by:[§C\.2](https://arxiv.org/html/2606.18624#A3.SS2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2606.18624#S2.p2.1)\.

## Appendix AData Construction

We construct the primary pragmatic QA data in two stages\. First, we generate short\-answer pragmatic QA instances from domain seeds and pragmatic section descriptions\. Second, we audit the generated instances with a binary self\-judging prompt and retain high\-quality examples for training\. The resulting filtered set is used as the primary problem data for both SFT target generation and GRPO training\.

### A\.1Pragmatic Sections

For data generation, each example is conditioned on one pragmatic section\. A section description is a short natural\-language definition of the type of pragmatic inference the generated question should require\. Definitions of the six sections are given in[Table˜6](https://arxiv.org/html/2606.18624#A1.T6)\.

Table 6:Pragmatic sections used for self\-generated QA data\. Each section description is inserted into the QA generation prompt to guide the model toward examples requiring the corresponding type of pragmatic interpretation\.
### A\.2QA Generation Prompt

To construct the primary Pragmatic QA data, we prompt the model to generate exactly one QA item for a given domain and pragmatic section\. Each generated item contains a concrete context, a question, and a short answer\. The prompt explicitly requires that the question be impossible to answer without pragmatic interpretation\. We use a small pool of manually inspected few\-shot examples to stabilize the format\. Generated items are rejected if they cannot be parsed, have missing fields, or duplicate an earlier item under a normalized string signature\.

QA Generation PromptSystem\.You are a pragmatic QA data generator\.Generate exactly one item for the target section below\.Target section: \{SECTION\_NAME\}Section description: \{SECTION\_DESCRIPTION\}You may reason before the final output if needed\.The final output must start with a line containing exactly FINAL\_QA:After FINAL\_QA:, return plain text only with this exact format:Question must be impossible to answer without pragmatic interpretation\. Answer must be short and clear\.FINAL\_QA:Content: \.\.\.Question: \.\.\.Answer: \.\.\.No JSON and no extra commentary after FINAL\_QA:\.User\.Domain keyword: \{DOMAIN\}Item: \{ITEM\_INDEX\}/\{ITEMS\_PER\_DOMAIN\}Field definitions:\- content: a concrete scenario/context only, with enough detail for pragmatic inference\.\- question: one explicit question about the scenario\.\- answer: a minimal, short, concise, single correct answer to the question without any explanation\.Section examples \(style reference\):\{FEW\_SHOT\_EXAMPLES\}Output plain text only with this exact format:FINAL\_QA:Content: \.\.\.Question: \.\.\.Answer: \.\.\.No JSON, no markdown, no extra commentary after FINAL\_QA:\.

### A\.3Automatic Audit and Filtering

We audit each generated Pragmatic QA instance with a binary quality\-judgment prompt\. The auditor is given calibration examples from the same section, followed by the generated item to judge\. An item is retained only if it is well\-formed, unambiguous, answerable, and requires the intended type of pragmatic interpretation\.

Automatic Audit and Filtering PromptSystem\.You are a dataset auditor for Pragmatic QA\. Judge whether the item is high\-quality for its SECTION\. Answer yes for high\-quality, no for low\-quality\.User\.Is this pragmatic QA example high\-quality?Criteria\.Pragmatic dependency: the gold answer requires pragmatic interpretation, not just literal reading\.Question correctness: the question itself is not incorrect or ambiguous\.Gold answer: the gold answer is correct and uniquely best\-supported by the given context\.Few\-shot examples for this same SECTION:*same\-section accepted and rejected calibration examples*\.Example to judge:SECTION: \{SECTION\};SCENARIO: \{CONTENT\};QUESTION: \{QUESTION\};GOLD ANSWER: \{GOLD\_ANSWER\}\.Answer just yes or no with no other output\. Final answer:

We apply the audit judge with the first\-token marginmm\(§[3\.1](https://arxiv.org/html/2606.18624#S3.SS1.SSS0.Px1)\) and discard the bottom50%50\\%of generated items bymm, while preserving balance across pragmatic sections\.

## Appendix BCounterfactual Answer Generation and Training

We train models in two sequential stages: supervised fine\-tuning \(SFT\) followed by GRPO\. Before SFT, we generate target answers for the filtered primary problem data\. The counterfactual condition uses a privileged counterfactual reasoning script during answer generation, while the student model is later trained without this script in the input\. The non\-counterfactual variant uses the same pipeline but replaces the counterfactual script with a lighter pragmatic QA prompt\.

### B\.1Answer Generation Prompts

For the counterfactual condition, we use a pragmatic QA prompt that explicitly instructs the model to interpret the observed utterance as a communicative choice among plausible alternatives\. The model is asked to reason about what the speaker could have said under a different intention and to use this contrast to infer the intended meaning\.

Counterfactual Answer Generation PromptSystem\.You are solving a pragmatic question\-answering task\.Your goal is to choose the best answer by inferring the speaker’s intended meaning in context\.Do not rely on the literal meaning of the utterance alone\. Instead, interpret the utterance as a communicative choice made by a roughly rational and informative speaker\. Think about why this speaker chose this utterance, in this context, given what the speaker likely knows\.When answering a question, use the following reasoning principles:1\. Identify the literal meaning of the utterance\.2\. Use the context and shared background to determine what the speaker is likely trying to communicate\.3\. Consider why the speaker chose this utterance instead of other plausible alternatives\.4\. Assume the speaker is trying to provide relevant information in context, but may not say more than is needed\.5\. Use the speaker’s likely knowledge and the shared context to infer what the listener is expected to understand\.6\. Choose the answer that best explains the utterance as a rational, informative, and contextually relevant choice\.When possible, justify your interpretation contrastively until you reach one clear interpretation:\- state one more direct, stronger, or more literal alternative the speaker could have said,\- explain what that alternative would have implied,\- then explain why the actual utterance suggests a different intended meaning\.Reason in the form:‘‘If the speaker had intended X, they would likely have said Y\. Because they said Z instead, the intended meaning is more likely W\.’’Guidelines:\- Prefer the answer that matches the speaker’s intended meaning, not just the surface wording\.\- Use context, shared background, speaker knowledge, and plausible alternatives to interpret the utterance\.\- Prefer interpretations that best explain the speaker’s choice of wording\.\- Do not infer more than the context, shared background, and the speaker’s choice support\.

For the non\-counterfactual baseline, we use a lighter pragmatic QA prompt that asks the model to consider pragmatic meaning, but does not instruct it to explicitly contrast the observed utterance with alternative utterances\. This isolates the effect of the counterfactual reasoning script from generic pragmatic prompting\.

Non\-counterfactual Answer Generation PromptSystem\.You are solving a pragmatic question\-answering task\.Answer the question by considering the pragmatic meaning of the utterance in its context\.Output only the final answer text\.

### B\.2Correctness Judge Prompt

Judge PromptSystem\.You are a strict QA evaluator\. Respond with exactly one word: either ‘yes’ or ‘no’\. Do not emit any other text, punctuation, or explanation\.User\.Task: QA answer grading\.CONTEXT: \{CONTENT\};QUESTION: \{QUESTION\};REFERENCE ANSWER: \{REFERENCE\};CANDIDATE ANSWER: \{CANDIDATE\}\.Is the candidate answer semantically correct given the context, question, and reference answer? Answer with a single word: yes or no\.

In all cases, the judge is the untuned instruct backbone matched in size to the policy under training, keeping the pipeline self\-contained and free of external distillation\.

We read the first\-token log\-probabilities foryesandnofrom the judge, convert them to probabilities, and compute the margin

m​\(x,a\)=p​\(yes∣x,a\)−p​\(no∣x,a\),m\(x,a\)=p\(\\texttt\{yes\}\\mid x,a\)\-p\(\\texttt\{no\}\\mid x,a\),wherexxis the problem \(context, question, and reference answer\) andaais the candidate answer\. We accept a candidate ifm​\(x,a\)\>0\.8m\(x,a\)\>0\.8\. The choice of threshold is justified by the human\-agreement study in §[B\.3](https://arxiv.org/html/2606.18624#A2.SS3)\.

### B\.3Margin Calibration

To verify that the judge margin reflects meaningful correctness confidence, we conduct a small human\-agreement calibration study\. We sample 100 examples and two authors independently label whether the candidate answer is semantically correct given the context, question, and reference answer\. The gold label is the consensus among non\-skip reviewers\. Examples that either reviewer marked as*skip*are assigned a label of*incorrect*rather than excluded, since they correspond to outputs the judge should not accept\. We then compare the margin\-thresholded judge decisions against the human gold labels\.

Table 7:Agreement between self\-judge margin thresholds and human labels on the manually reviewed calibration subset of 100 examples\. Examples that either reviewer marked as*skip*are mapped to*incorrect*in the gold label rather than excluded, so all 100 examples are retained\. Precision, recall, F1, and accuracy are computed by treating the margin\-thresholded judge decision as the prediction and the human label as gold\.The calibration results show that the judge margin is informative: increasing the threshold generally makes the judge more conservative, reducing recall while maintaining comparable precision\. The best human\-agreement F1 is obtained atm\>0\.7m\>0\.7\(0\.800\), whilem\>0\.8m\>0\.8remains a high\-agreement operating point with precision 0\.780, recall 0\.812, F1 0\.796, and accuracy 0\.800\. Very strict thresholds such asm\>0\.99m\>0\.99substantially reduce recall, suggesting that overly conservative filtering discards many human\-acceptable responses\. We adoptτ=0\.8\\tau=0\.8as a conservative midpoint that preserves judge precision\. The threshold is fixed before running downstream experiments and is not tuned against pragmatic benchmark performance\. The same value is used for both Qwen3\-8B and Qwen3\-14B\.

### B\.4Supervised Fine\-Tuning

#### SFT Pregeneration\.

Letxxdenote an input instance from the filtered primary problem data, and lets∈𝒮s\\in\\mathcal\{S\}denote its pragmatic section label\. We writed​\(s\)d\(s\)for the natural\-language description of sectionss, andpcfp\_\{\\mathrm\{cf\}\}for the counterfactual pragmatic reasoning script used during response pre\-generation\. Letπθ\\pi\_\{\\theta\}denote the base policy used to generate candidate SFT targets\. For each retained problem, we construct an augmented teacher\-side prompt

x~=Aug​\(x,s\)=\[pcf;d​\(s\);x\],\\tilde\{x\}=\\mathrm\{Aug\}\(x,s\)=\[p\_\{\\mathrm\{cf\}\};\\,d\(s\);\\,x\],which exposes the teacher to both the section description and an explicit counterfactual reasoning scaffold\. Givenx~\\tilde\{x\}, the model samples a response

y∼πθ\(⋅∣x~\),y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\tilde\{x\}\),wherey=\(r,a\)y=\(r,a\)consists of a reasoning tracerrand a final answeraa\. We then apply a binary self\-judge with the margin method \(§[B\.2](https://arxiv.org/html/2606.18624#A2.SS2)\)\.

J​\(x~,a\)∈\{0,1\},J\(\\tilde\{x\},a\)\\in\\\{0,1\\\},which returns 1 only if the response is judged pragmatically correct with respect to the context, question, and reference answer\. This yields the accepted set

𝒟accept=\{\(x,s,y\)∣J​\(x~,a\)=1\}\.\\mathcal\{D\}\_\{\\mathrm\{accept\}\}=\\\{\(x,s,y\)\\mid J\(\\tilde\{x\},a\)=1\\\}\.
The final SFT dataset removes the privileged augmentation from the student input\. Although the target responseyyis generated underx~\\tilde\{x\}, the student is trained only on the original problemxxpaired with the accepted output:

𝒟SFT=\{\(x,y\)∣\(x,s,y\)∈𝒟accept\}\.\\mathcal\{D\}\_\{\\mathrm\{SFT\}\}=\\\{\(x,y\)\\mid\(x,s,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{accept\}\}\\\}\.This asymmetry between teacher\-side generation and student\-side training is central to our design: the counterfactual reasoning script is used to construct high\-quality reasoning traces, but the student must learn to produce such traces without seeing the script at inference time\. We train with the standard causal language modeling objective on𝒟SFT\\mathcal\{D\}\_\{\\mathrm\{SFT\}\}, masking prompt tokens and applying loss only to the assistant response:

ℒSFT=−∑\(x,y\)∈𝒟SFT∑t=1\|y\|log⁡πθ​\(yt∣x,y<t\)\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}=\-\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{SFT\}\}\}\\sum\_\{t=1\}^\{\|y\|\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\.

#### SFT Hyperparameters\.

We use full\-parameter fine\-tuning with maximum sequence length 8192, bfloat16 precision, AdamW, cosine learning\-rate schedule, learning rate5×10−75\\times 10^\{\-7\}, two epochs, per\-device batch size 1, gradient accumulation 4, warmup ratio 0\.03, gradient clipping 1\.0, and gradient checkpointing\. Distributed runs use FSDP full\-shard training with decoder\-layer auto\-wrapping and full\-state\-dict checkpointing\.

### B\.5Dataset Size and Training Budget

We first sample 1,000 domain seeds and pair each seed with each of the six pragmatic sections, yielding 6,000 seed–section generation prompts\. All 6,000 generations are parsed into valid short\-answer QA items\. The self\-filtering stage retains 3,000 primary problem instances, corresponding to 500 examples per pragmatic section\. These 3,000 filtered problems are used as the GRPO training prompts\.

For SFT, we generate target responses for the filtered primary problems using the counterfactual reasoning script and then apply the answer\-quality judge\. During Qwen3\-14B training, this produces 2,816 accepted SFT targets, of which 2,759 are used for SFT training and 57 are held out as a synthetic validation split\.

Table 8:Aggregate data counts after answer generation and filtering\. The accepted SFT targets are generated from the filtered primary problem data\.
### B\.6GRPO

We initialize the policy from the SFT checkpointπSFT\\pi\_\{\\mathrm\{SFT\}\}and optimize it with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18624#bib.bib8)\)on the primary problem data\. Below we first specify the algorithmic setup \(reward, optimization objective\) and then the implementation details \(hyperparameters, infrastructure\)\.

#### Easy\-Prompt Filtering\.

Following DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.18624#bib.bib29)\), we apply an additional difficulty\-based pass over the primary problem data before GRPO training\. For each promptxx, we sampleG=8G=8rollouts from the SFT checkpointπSFT\\pi\_\{\\mathrm\{SFT\}\}and score each rollout with the same correctness judge and marginmm\(§[B\.2](https://arxiv.org/html/2606.18624#A2.SS2)\) used during GRPO \(thresholdτ=0\.8\\tau=0\.8\)\. A prompt is marked*easy*if every one of itsGGrollouts passes the judge, and is discarded\. Such prompts yield zero\-variance advantage estimates under group normalization and therefore contribute no gradient signal to the policy update\. Filtering thus concentrates training on prompts at the frontier ofπSFT\\pi\_\{\\mathrm\{SFT\}\}’s capability\.

#### Reward Design\.

*Format reward\.*Rfmt​\(y\)∈\{0,0\.5,1\}R\_\{\\mathrm\{fmt\}\}\(y\)\\in\\\{0,\\,0\.5,\\,1\\\}is a dense shaping signal that stabilizes early training by encouraging the model to maintain the structured output it acquired during SFT\. A response receives0\.50\.5for containing exactly one well\-formed pair of<think\>…\\ldots</think\>tags, and an additional0\.50\.5for a valid\\boxed\{\}answer in the post\-thinking segment\.

For each training promptxx, the policyπθ\\pi\_\{\\theta\}samples a group ofG=8G=8candidate responses\{y1,…,yG\}\\\{y\_\{1\},\\ldots,y\_\{G\}\\\}, where eachyiy\_\{i\}consists of a reasoning trace and a boxed final answeraia\_\{i\}\. Each response is scored by a composite reward

R​\(x,yi\)=wfmt​Rfmt​\(yi\)\+wans​Rans​\(x,yi\),R\(x,y\_\{i\}\)\\;=\\;w\_\{\\mathrm\{fmt\}\}\\,R\_\{\\mathrm\{fmt\}\}\(y\_\{i\}\)\\;\+\\;w\_\{\\mathrm\{ans\}\}\\,R\_\{\\mathrm\{ans\}\}\(x,y\_\{i\}\),combining a format\-compliance term with a pragmatic\-correctness term\.

*Correctness reward\.*Rans​\(x,y\)∈\{0,1\}R\_\{\\mathrm\{ans\}\}\(x,y\)\\in\\\{0,1\\\}reuses the judge marginm​\(x,a\)m\(x,a\)from §[B\.2](https://arxiv.org/html/2606.18624#A2.SS2)\. If\\boxed\{\}extraction fails, we setRans​\(x,y\)=0R\_\{\\mathrm\{ans\}\}\(x,y\)=0without querying the judge\. Otherwise we extract the candidate answeraafrom the rollout and assignRans​\(x,y\)=1R\_\{\\mathrm\{ans\}\}\(x,y\)=1ifm​\(x,a\)\>τm\(x,a\)\>\\tau, withτ=0\.8\\tau=0\.8fixed by the calibration in Appendix[B\.3](https://arxiv.org/html/2606.18624#A2.SS3)\.

#### Reward Scaling\.

We setwfmt=1w\_\{\\mathrm\{fmt\}\}=1andwans=2w\_\{\\mathrm\{ans\}\}=2so that correctness strictly dominates format\. This asymmetric scaling prevents the policy from trading off pragmatic correctness for the denser, near\-saturated format signal during early optimization\. We use the smallest integer weighting that establishes this strict ordering rather than tuning the ratio against benchmark performance\.

#### Optimization Objective\.

Given the per\-response rewards, GRPO computes group\-normalized advantages

A^i=R​\(x,yi\)−mean​\(\{R​\(x,yj\)\}j=1G\)std​\(\{R​\(x,yj\)\}j=1G\),\\hat\{A\}\_\{i\}=\\frac\{R\(x,y\_\{i\}\)\-\\mathrm\{mean\}\\\!\\left\(\\\{R\(x,y\_\{j\}\)\\\}\_\{j=1\}^\{G\}\\right\)\}\{\\mathrm\{std\}\\\!\\left\(\\\{R\(x,y\_\{j\}\)\\\}\_\{j=1\}^\{G\}\\right\)\},and updates the policy with a clipped PPO\-style surrogate objective without a learned value function\. Because the learning signal comes from*relative*quality differences within each group, training is robust to the absolute scale of the rewards\.

#### GRPO Hyperparameters\.

We use full\-parameter optimization with bfloat16 precision, AdamW, and a cosine learning\-rate schedule with peak learning rate4×10−64\\times 10^\{\-6\}and warmup ratio0\.10\.1\. The train batch size and PPO mini\-batch size are both128128, with per\-GPU micro\-batch size22\. Maximum prompt and response lengths are512512and1,5361\{,\}536tokens, respectively; overlong prompts are filtered out and the remainder are left\-truncated\. FollowingSrivastavaet al\.\([2025](https://arxiv.org/html/2606.18624#bib.bib55)\); Xinet al\.\([2025](https://arxiv.org/html/2606.18624#bib.bib5)\); Shaoet al\.\([2024](https://arxiv.org/html/2606.18624#bib.bib8)\); Heet al\.\([2025](https://arxiv.org/html/2606.18624#bib.bib61)\), we regularize the policy towardπSFT\\pi\_\{\\mathrm\{SFT\}\}with a low\-variance KL loss applied directly to the objective with coefficient0\.020\.02, which discourages drift from the counterfactual reasoning behaviors acquired during SFT\. We train for44epochs over the primary problem data\.

#### GRPO Infrastructure\.

Each GRPO run uses55NVIDIA H200 GPUs:44GPUs host the policy and serve rollouts in\-process via vLLM \(one GPU per node, FSDP2 full\-shard across the44nodes\), and the remaining GPU runs a separate vLLM endpoint hosting the frozen instruct model as the judge\. Rollouts are sampled at temperature1\.01\.0, top\-pp1\.01\.0, and top\-kkdisabled, withG=8G=8samples per prompt\. End\-to\-end wall\-clock for the44\-epoch run on the primary problem data is approximately2\.5hours on this configuration\.

### B\.7GRPO Checkpoint Selection

To avoid last\-checkpoint or hand\-picked bias, we report the checkpoint chosen by a fixed selection protocol\. GRPO saves a checkpoint at every optimizer step and selects among these checkpoints as follows\.

#### Held\-out selection set\.

We construct the selection set from a fresh round of the self\-generation pipeline described in[Appendix˜A](https://arxiv.org/html/2606.18624#A1), run independently of the round used to produce the GRPO training data\. From this fresh pool we draw a seeded, stratified sample of100100examples per pragmatic section\. To guarantee disjointness from training, any selection row whose\(context,question\)\(\\text\{context\},\\text\{question\}\)pair appears in the GRPO training data is discarded\. The selection set is fixed with seed4242across all checkpoints of all runs of a given model, so all checkpoints of a model are graded on exactly the same rows\.

#### Scoring\.

Each checkpoint is scored on the selection set using the same correctness judge and marginm​\(x,a\)m\(x,a\)\(§[B\.2](https://arxiv.org/html/2606.18624#A2.SS2)\) that define the GRPO training reward, with a row counted correct ifm​\(x,a\)\>0\.8m\(x,a\)\>0\.8\. This means selection accuracy is the GRPO training reward itself, computed on held\-out data, rather than a separate evaluation metric\. For decoding, we generate one rollout per row with temperature0andmax\_tokens=2048=2048, scoring each row independently\.

#### Selection Rule\.

We pick the checkpoint with the highest selection accuracy\. Exact ties are broken in favor of the later step, on the principle that the later step has absorbed strictly more of the training signal and is therefore the more conservative choice to promote\.

#### Independence from Test Benchmarks\.

The selection set consists of self\-generated pragmatic QA filtered by the same base\-model judge used throughout our pipeline \(matched in size to the trained model\), and shares no items withPragMega,Ludwig,MetoQA, orAltPrag\. Selection therefore cannot leak signal from these benchmarks, so the test numbers in Table[1](https://arxiv.org/html/2606.18624#S3.T1)measure generalization beyond the selection pool\.

## Appendix CBenchmarks and Results

### C\.1Benchmark Examples

[Table˜9](https://arxiv.org/html/2606.18624#A3.T9)shows two of each benchmark example\.

Table 9:Representative examples from the four pragmatic evaluation benchmarks used in our experiments\. The examples illustrate the different forms of pragmatic interpretation tested by each benchmark: multiple\-choice pragmatic QA inPragMega, binary implicature resolution inLudwig, metonymic reference resolution inMetoQA, and open\-ended implied\-meaning recovery inAltPrag\.
### C\.2Exploratory Evaluation on Non\-Cooperative Pragmatics

To examine whether counterfactual reasoning transfers to non\-cooperative and adversarial settings, we evaluatePragReSTusing the Strategic Dialogue Assessment \(SDA\) framework introduced byZhenget al\.\([2026](https://arxiv.org/html/2606.18624#bib.bib60)\)\. SDA evaluates courtroom cross\-examinations as strategic exchanges, measuring whether a model can track how each response affects the speaker’s position in the dialogue\.Zhenget al\.\([2026](https://arxiv.org/html/2606.18624#bib.bib60)\)find that LLMs can rely on surface\-level discourse cues when judging adversarial dialogue, sometimes treating damage control strategies such as hedging or deflection as neutral or positive rather than recognizing them as attempts to mitigate a harmful commitment\. This setting therefore provides a complementary test of whetherPragReSThelps models reason beyond the cooperative surface form of an utterance\.

We focus on three primary SDA metrics:BaT\(Benefit at Turn\), which measures alignment with human judgments of strategically beneficial moves;PaT\(Penalty at Turn\), which measures alignment with human judgments of strategically detrimental moves; andNRBaT\(Normalized Relative Benefit at Turn\), which captures the cumulative balance between benefits and penalties over the dialogue\.

Table 10:Performance on the SDA framework\. Values report mean Spearman’sρ\\rhocorrelations with human judgments across five seeds \(using temperature sampling at 0\.6\), with standard deviations shown after±\\pm\. Higher values indicate stronger alignment with human judgments\.#### Quantitative Results\.

As shown in Table[10](https://arxiv.org/html/2606.18624#A3.T10), the most consistent change appears onPaT\.PragReSTincreases PaT for both Qwen3\-8B \(from−0\.001\-0\.001to0\.0130\.013\) and Qwen3\-14B \(from0\.0470\.047to0\.0620\.062\)\. These gains are modest relative to variation across seeds, so we interpret them cautiously\. Still, the consistent direction of the change suggests that counterfactual training may improve the model’s ability to recognize when a response imposes a strategic cost on the speaker, rather than treating locally cooperative answers as neutral or beneficial\.

The remaining SDA metrics show a more mixed pattern\. For Qwen3\-8B,PragReSTimproves NRBaT from0\.1490\.149to0\.1790\.179, but decreases BaT from0\.1320\.132to0\.0510\.051\. For Qwen3\-14B,PragReSTimproves BaT and PaT, while NRBaT remains essentially unchanged, moving from0\.0750\.075to0\.0730\.073\. We therefore interpret the SDA results as evidence for a targeted improvement in recognizing strategic penalties, rather than a uniform improvement across all dimensions of adversarial dialogue assessment\.

#### Qualitative analysis\.

To better understand the PaT gains, we inspect turns wherePragReSTagrees with the human penalty judgment but the instruct backbone does not\. Most recovered cases involve a change in how the model interprets the witness’s answer:PragReSTis more likely to recognize that the witness has conceded information that helps the opposing side\. In SDA terms, this means identifying a response as strategically harmful even when it is locally clear, truthful, and relevant\. Thus, the PaT gains suggest thatPragReSTis not simply rewarding answers for being clear or responsive\. Instead, it more often recognizes when an apparently cooperative answer gives the opposing side useful information\.

For example, when a witness is asked whether the defendant “voluntarily spoke with you in a tape\-recorded interview without the presence of counsel” and answers “Yes,” the instruct model recognizes that the response is clear and responsive, but still treats it as beneficial to the witness’s side\.PragReSTinstead identifies the strategic implication of the same answer: by confirming the questioner’s premise, the witness gives the opposing side the concession it is seeking\. Similar patterns appear when a witness gives a precise damaging answer \(“Seven” abrasions\), confirms a document detail \(“Yes, it is”\), concedes a contamination pathway \(“it is likely”\), or admits a lack of licensed qualification\. In each case,PragReSTtreats the utterance not merely as a cooperative answer, but as a commitment whose strategic value depends on the question under discussion\.

This behavior suggests that the model interprets each utterance in relation to the adversarial context: it considers what the answer allows the questioner to infer and whether that inference advances the opposing side’s case\. In the recovered cases,PragReSToften reasons over alternatives implicitly, recognizing that a direct answer rather than a hedge, a concession rather than a denial, or a clarification that still preserves a damaging inference can change which side the utterance benefits\. This supports the quantitative pattern in Table[10](https://arxiv.org/html/2606.18624#A3.T10): the most consistent gains appear in PaT, where success depends on recognizing when a response creates a strategic cost for the speaker\.

## Appendix DDetails of the Counterfactual Error Analysis

This appendix describes the construction, validation, and use of the error taxonomy and counterfactual\-reasoning scores used in[Section˜5\.1](https://arxiv.org/html/2606.18624#S5.SS1)\.

### D\.1Error Taxonomy

#### Error\-case Collection\.

We collect incorrectPragMegapredictions from each evaluated model\. For every incorrect example, we retain the original prompt, answer options, gold answer, model prediction, and model reasoning trace\. These traces are used only for diagnostic analysis, not for computing task accuracy\.

#### Inducing the Error Taxonomy\.

We induce the error taxonomy in a bottom\-up manner\. Instead of manually specifying categories before inspecting the data, we prompt a language\-model annotator to read batches of incorrect examples and propose recurring failure modes\. The annotator is instructed to focus on the underlying pragmatic reasoning failure rather than superficial lexical differences\. After inspecting the proposed categories, we merge near\-duplicates and remove categories that are too broad, too rare, or outside the scope of pragmatic reasoning\. We also remove categories that merely indicate thatPragReSTintroduces our target behavior, since the goal is to characterize model errors rather than reward\-specific style differences\.

The final taxonomy contains the following non\-exclusive tags\.

- •Literal / surface bias: the model anchors on literal wording or shallow semantic compatibility when the context requires a non\-literal pragmatic interpretation\.
- •Missed communicative intent: the model fails to recover the speaker or listener’s pragmatic goal, such as politeness, avoidance, deception, indirect request, complaint, or social positioning\.
- •Unsupported or overextended inference: the model over\-reasons from weak cues, invents assumptions not licensed by the prompt, or post\-hoc rationalizes an incorrect answer\.
- •Coherence bridge error: the model misjudges whether an implicit causal, temporal, or discourse bridge between events is warranted\.
- •Figurative or humor mapping error: the model fails to map figurative language, jokes, punchlines, or humorous incongruity to the intended interpretation\.

Tags are multi\-label: a single error may receive more than one tag if multiple failure modes are present\.

### D\.2Annotation Protocol and Human Validation

#### GPT Annotation Protocol\.

For full\-scale annotation, we use a GPT\-4\.1\-mini annotator\. The annotator receives the prompt, gold answer, model prediction, and reasoning trace, along with the final taxonomy and short definitions of each error type\. It is instructed to assign all applicable labels and to avoid assigning a pragmatic label when the failure is better explained by a concrete context or option misread\. The model returns a structured label set for each incorrect example\.

#### Human Validation\.

To check whether GPT labels are reliable enough for diagnostic analysis, we run a blind annotation study\. Three human annotators independently label the same subset of examples using the same taxonomy, without seeing the model identity\. Human A is an external annotator who was not involved in the project\. We compute pairwise agreement among humans and between each human and GPT\.

Table 11:Agreement study for the induced error taxonomy\. All annotators labeled the same shared subset of errors under a blind setting without access to model identities\. Human A is an external annotator who was not involved in the project\.The GPT annotator agrees with humans at approximately the same level as humans agree with one another\. We therefore use GPT labels for the full analysis, but treat the resulting labels as a scalable diagnostic rather than as definitive ground truth\.

Table 12:Agreement between human and LLM annotators for the induced pragmatic error taxonomy\. Each LLM row reports the average agreement between that LLM annotator and the three human annotators\. Agreement is computed over binary decisions for the five overlapping error categories\.

### D\.3Counterfactual Reasoning Score

For each reasoning trace, we compute a counterfactual\-reasoning score using a GPT\-based evaluator\. The evaluator is instructed to judge only the reasoning trace, not whether the final answer is correct\. It assigns five binary indicators corresponding to explicit counterfactual reasoning, alternative utterance or action, mismatch or contrast, speaker intent or pragmatic goal, and literal\-versus\-pragmatic contrast\. The CF score is the sum of these indicators and ranges from 0 to 5\. We use this score only for diagnostic analysis, not for training, filtering, checkpoint selection, or model evaluation\.

![Refer to caption](https://arxiv.org/html/2606.18624v1/x6.png)Figure 6:Full diagnostic breakdown of error reduction and counterfactual\-reasoning scores across failure modes\. Rows correspond to induced error categories, and columns report the error\-rate change and counterfactual\-reasoning score statistics used in the main analysis\.Counterfactual Reasoning Score PromptSystem\.You are evaluating whether a model’s reasoning trace uses counterfactual pragmatic reasoning\.Return only valid JSON\. Do not judge whether the final answer is correct; judge only the reasoning trace\.User\.Score the reasoning trace using the rubric below\. Assign each field 0 or 1\.1\. Explicit counterfactual\.Assign 1 if the trace explicitly reasons about a counterfactual pragmatic or communicative alternative, e\.g\., ‘‘if the speaker meant X, they would/could have said Y,’’ ‘‘had she intended X,’’ or ‘‘if this were literal, we would expect …’’\. Do not count generic uncertainty, ordinary causal hypotheses, or narrative\-coherence alternatives unless the trace compares actual behavior, wording, option choice, or interpretation against an alternative communicative/pragmatic possibility\.2\. Alternative utterance or action\.Assign 1 if the trace identifies an alternative wording, action, answer, or response that would be expected under a different intended meaning\.3\. Mismatch or contrast\.Assign 1 if the trace explicitly notes a mismatch or contrast between the literal/surface reading and contextual/pragmatic cues, or between an option and what the speaker or situation would normally imply\.4\. Speaker intent or pragmatic goal\.Assign 1 if the trace reasons about speaker/listener intent, social goal, politeness, deception, irony, indirectness, communicative purpose, or pragmatic meaning\.5\. Literal vs\. pragmatic contrast\.Assign 1 if the trace explicitly contrasts literal meaning or face\-value reading with intended, pragmatic, figurative, indirect, ironic, or non\-literal meaning\.Compute cf\_score as the sum of the five binary fields, from 0 to 5\.Question prompt:\{FULL\_PROMPT\}Model reasoning trace:\{REASONING\}

## Appendix EArtifact Licenses

Table 13:All datasets and models were used in accordance with their intended use\.
## Appendix FNote on AI Usage

We used AI tools for grammar correction and code completion\.

Similar Articles

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.

Inducing Reasoning Primitives from Agent Traces

arXiv cs.AI

Introduces Reasoning Primitive Induction, a method that mines successful ReAct traces to cluster recurrent reasoning moves into typed pseudo-tools, outperforming the original agent by tens of percentage points on benchmarks.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv cs.CL

Introduces ReasoningFlow, a framework to capture discourse structures of large language model reasoning traces as directed acyclic graphs, enabling fine-grained analysis of reasoning behaviors like self-reflection and backtracking. Based on manual and automatic annotation of thousands of traces, it reveals structural similarities across models and that most erroneous steps do not contribute to final answers.