How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
Summary
This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# How Hypocritical Is Your LLM judge? Listener–Speaker Asymmetries in the Pragmatic Competence of Large Language Models
Source: https://arxiv.org/html/2604.15873
Sina Zarrieß Computational Linguistics, Department of Linguistics Bielefeld University, Germany \{j\.sieker;sina\.zarriess\}@uni\-bielefeld\.de
###### Abstract
Large language models \(LLMs\) are increasingly studied as repositories of linguistic knowledge\. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another\. As a result, it remains unclear whether success in one role aligns with success in the other\. In this paper, we address this question for pragmatic competence by comparing LLMs’ performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language\. We evaluate multiple open\-weight and proprietary LLMs across three pragmatic settings\. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers\. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices\.
How Hypocritical Is Your LLM judge? Listener–Speaker Asymmetries in the Pragmatic Competence of Large Language Models
Judith Sieker and Sina ZarrießComputational Linguistics, Department of LinguisticsBielefeld University, Germany\{j\.sieker;sina\.zarriess\}@uni\-bielefeld\.de
## 1Introduction
Pragmatic competence is not one single ability\. In everyday communication, people may recognize utterances as pragmatically odd, underspecified, or misleading, even though they do not always manage to produce fully appropriate responses themselves\. For example, consider the question*"How old is the current king of France?"*As listeners, humans can readily judge that an answer such as*“There is no king of France”*challenges the false presupposition in the question\. As speakers, however, producing such a response is more demanding: one must first detect the false presupposition, then decide not to answer the question on its own terms but to challenge its underlying assumption, and finally formulate an appropriate corrective reply\. Judging pragmatic adequacy in an observed question–answer pair thus places different demands than producing a pragmatically appropriate response from scratch\. Psycholinguistic work captures this asymmetry by treating language comprehension and production as related but non\-identical tasks: they rely on overlapping knowledge, yet often differ in processing demands and error profilesFlynn \(1986 (https://arxiv.org/html/2604.15873#bib.bib12)\); Meyer et al\. \(2016 (https://arxiv.org/html/2604.15873#bib.bib24)\); Ferreira and Ferreira \(2024 (https://arxiv.org/html/2604.15873#bib.bib11)\)\.
This distinction, however, has received little systematic attention in the evaluation of large language models \(LLMs\), where linguistic knowledge, not only in the domain of pragmatics, has been investigated from multiple anglesChang and Bergen \(2023 (https://arxiv.org/html/2604.15873#bib.bib10)\); Ma et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib23)\), including generation\-based tasks that probe models’ ability to generate contextually appropriate responsesSieker et al\. \(2023 (https://arxiv.org/html/2604.15873#bib.bib38)\); Jian and Siddharth \(2024 (https://arxiv.org/html/2604.15873#bib.bib18)\); Wu et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib48)\), as well as judgment\-style tasks in which models classify, interpret or evaluate linguistic outputsSileo et al\. \(2022 (https://arxiv.org/html/2604.15873#bib.bib41)\); Park et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib31)\); Hu et al\. \(2023 (https://arxiv.org/html/2604.15873#bib.bib16)\)\. On top of that, LLM\-as\-a\-judge formats are becoming increasingly popular, where models are used instead of human annotators to assess language quality or correctnessLi et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib22)\); Bavaresco et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib7)\)\.
What is typically left implicit, however, is whether these two evaluation perspectives – generation, which we refer to asspeaking, and judgment, which we refer to aslistening– reflect the same aspects of model performance\. In practice, results from either type of setup are often discussed as evidence for or against a model’s competence, without testing whether performance transfers across roles\. Especially for pragmatic reasoning tasks, however, this assumption is not obvious: differences between generating an appropriate answer and judging an \(un\)appropriate one may lead evaluation setups to probe distinct capacities and error profiles\.
In this paper, we address this gap by contrasting LLMs’ behavior as*pragmatic listeners*and*pragmatic speakers*\. We ask whether models that succeed at judging pragmatic adequacy also succeed in generating pragmatically appropriate language, or whether these capacities dissociate\. We focus on three pragmatic tasks – False Presuppositions, Antipresuppositions, and Deductive Reasoning – which have been used in LLM evaluations, but have typically been assessed from only one of the two roles\(Lachenmaier et al\.,2025 (https://arxiv.org/html/2604.15873#bib.bib21); Sieker and Zarrieß,2023 (https://arxiv.org/html/2604.15873#bib.bib40); Mondorf and Plank,2024 (https://arxiv.org/html/2604.15873#bib.bib26)\)\. For each task, we construct parallelspeakerandlistenersetups over the same underlying items, enabling direct, item\-level comparisons\.
Our results reveal a consistent asymmetry between pragmatic listening and speaking in current LLMs\. Across tasks, many models achieve substantially higher accuracy when judging pragmatic appropriateness than when tasked to generate pragmatically appropriate outputs themselves\. Item\-level analyses further show that correct judgments do not reliably predict successful generation\. Our findings suggest that pragmatic evaluation and generation constitute partially distinct capabilities in current models, and that performance in listener\-style evaluation tasks should not be taken as a proxy for pragmatic competence in generation\.
## 2Related Work
#### Production and Comprehension in Psycholinguistics\.
In psycholinguistics, language production and comprehension are generally treated as related but non\-identical abilities\. Although they draw on shared linguistic knowledge, they differ in task demands and processing constraints\(Meyer et al\.,2016 (https://arxiv.org/html/2604.15873#bib.bib24); Ferreira and Ferreira,2024 (https://arxiv.org/html/2604.15873#bib.bib11)\)\. Empirical work tends to point to a comprehension advantage\. For instance, in a large\-scale cross\-linguistic study of more than 100,000 children,Bornstein and Hendricks \(2012 (https://arxiv.org/html/2604.15873#bib.bib8)\)found that comprehension typically precedes and exceeds production: listeners often understand linguistic forms that they cannot yet produce as speakers\. Other experimental work further shows that comprehension and production tasks can probe different aspects of linguistic competence\. Comprehension can succeed via contextual or heuristic strategies, whereas production requires the explicit selection and construction of linguistic structure under planning and memory constraints, making it more demanding\(Flynn,1986 (https://arxiv.org/html/2604.15873#bib.bib12); Ferreira and Ferreira,2024 (https://arxiv.org/html/2604.15873#bib.bib11)\)\. As a result, success in comprehension tasks does not guarantee corresponding success in production\.
#### Generating and Judging in LLMs\.
When it comes to evaluating pragmatic \(and more generally, linguistic\) competence in LLMs, existing work often does not clearly distinguish between comprehension\- and production\-based abilities\. Instead, much of the existing literature implicitly assigns models one of two roles, which we operationalise aslistening– evaluating the pragmatic adequacy of a given utterance – andspeaking– generating a pragmatically appropriate utterance\.
Existing work has predominantly evaluated models in the listener role, targeting language comprehension abilities by asking models to classify, rate, or evaluate linguistic outputs\. For example,Sileo et al\. \(2022 (https://arxiv.org/html/2604.15873#bib.bib41)\)aggregate benchmarks on different pragmatic phenomena \(e\.g\., discourse relations, speech acts or implicatures\) to assess how well NLU models capture pragmatic meaning beyond literal semantics\. For this, they ask models to interpret, classify, or judge given utterances\.Hu et al\. \(2023 (https://arxiv.org/html/2604.15873#bib.bib16)\)compare humans and language models on different pragmatic phenomena by using multiple\-choice materials, asking models to interpret a speaker’s utterance and select the intended meaning or rationale from multiple choices\.Park et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib31)\)propose a multilingual benchmark for evaluating pragmatic comprehension in LLMs that is grounded in Grice’s Cooperative Principle\. Here, models are placed in the role of an interpreter of a given utterance, tasked to infer intended meanings by choosing among candidate interpretations\. Similarly,Askari et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib5)\)evaluate BabyLMs’ adherence to Gricean maxims by testing whether models assign higher probability to maxim\-adhering than to maxim\-violating candidate answers\.
In contrast to such listener tasks,speaker\-oriented evaluations target language production abilities as a probe of linguistic competence, using free or constrained generation\. For example,Sieker et al\. \(2023 (https://arxiv.org/html/2604.15873#bib.bib38)\)study whether Implicit Causality prompts can be used to evaluate discourse\-level text generation in LLMs\. The models’ task is to generate sentence continuations \(e\.g\., "Tom admired Sarah because ..."\), and human annotators judge the quality of the generated text\.Jian and Siddharth \(2024 (https://arxiv.org/html/2604.15873#bib.bib18)\)investigate if LLMs behave like pragmatic speakers by evaluating utterance production preferences in reference games, measuring how likely models are to generate particular referring expressions given a target object and context\.Ali et al\. \(2026 (https://arxiv.org/html/2604.15873#bib.bib2)\)also use reference games, but focus on whether models translate uncertainty into pragmatically appropriate clarification requests\.Wu et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib48)\)assess models’ pragmatic competence based on generated responses to social–pragmatic scenarios, using reference\-based and preference\-based evaluation of free\-form outputs\.
Crucially, these two evaluation paradigms are rarely examined in direct relation to one another\. Models are usually assessed either in listener\-style or speaker\-style settings, and results from one paradigm are often interpreted in terms of general linguistic competence, without testing whether performance transfers across roles\. One notable exception isQiu et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib35)\), who evaluate both comprehension and production within a communicative game setting\. However, production performance is assessed indirectly via listener success \(i\.e\., speaker outputs are evaluated insofar as they enable correct interpretation by a listener\), and the study does not examine how judging and generation relate on the same items across different pragmatic phenomena\.
In parallel, the use of LLMs as automatic judges is becoming increasingly common, both as components of evaluation pipelines and as substitutes for human annotationLi et al\. \(2024 (https://arxiv.org/html/2604.15873#bib.bib22)\); Calderon et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib9)\)\. In these setups, models are explicitly placed in a listener\-style role, where they assess or rate linguistic outputs produced by others\. Although LLMs have been used as judges in pragmatic reasoning tasksYu et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib49)\), to our knowledge, no prior work systematically evaluates their adequacy in this linguistic domain\. Existing studies instead emphasize alignment with human ratingsBavaresco et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib7)\); Thakur et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib46)\)or examine judge behavior in other domains, such as mathematical reasoningStephan et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib45)\)\.
Furthermore, while LLM\-as\-a\-judge approaches do not generally claim that judgment performance determines generative ability, evaluative behavior is often seen as an indicator of model competence – yet whether success in listener\-style judgment aligns with success in speaker\-style generation remains largely untested\. A notable related study isPiot et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib33)\), who compare model behavior as judges and as generators in non\-pragmatic domains such as content moderation and safety\. While their results reveal systematic differences between evaluative and generative behavior, the study does not consider pragmatic tasks or examine judging and generation on the same underlying items\. As a result, it remains open whether similar asymmetries arise for pragmatic evaluation and generation\. In the following, we address this question empirically\.
## 3Experiment
Pragmatic Speaker PromptPragmatic Listener PromptFalse PresuppositionsLachenmaier et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib21)\); Sieker et al\. \(2025 (https://arxiv.org/html/2604.15873#bib.bib39)\)Did voters find out that the AfD party is in favor of Ukraine becoming a member of the EU?Assess whether the false presupposition from the prompt was accommodated \(accepted\) in the response or not\.A= false presupposition was accommodated \(the false presupposition was clearly accepted/confirmed in the response\)N= false presupposition was not accommodated \(the presupposition was clearly rejected/refuted in the response\)U= unclear \(anything else; i\.e\., responses where we cannot say for sure whether the false presupposition was accepted or not\)Prompt:"Did voters find out that the AfD party is in favor of Ukraine becoming a member of the EU?"False presupposition:"The AfD party is in favor of Ukraine becoming a member of the EU\."Answer:"74 percent of those surveyed wanted Ukraine to become a member of the EU\."Answer with a letter from \{A, N, U\}\. Do not explain your answer\.AntipresuppositionsSchneider et al\. \(2019 (https://arxiv.org/html/2604.15873#bib.bib37)\); Sieker and Zarrieß \(2023 (https://arxiv.org/html/2604.15873#bib.bib40)\)Fill in the missing word marked with XXXX in the following sentence\.Sentence:"Jan’s mother went shopping\. She bought a pear and two bananas\. Of these, Jan received XXXX banana\."Choose one of the following words: ’the’ or ’a’\.Answer with one word only\. Do not explain your answer\.Assess whether the continuation is appropriate given the context sentence\.A= The continuation is appropriate\.N= The continuation is not appropriate\.Context sentence:"Jan’s mother went shopping\. She bought a pear and two bananas\."Continuation:"Of these, Jan received the banana\."Answer with a letter from \{A, N\}\. Do not explain your answer\.Deductive ReasoningVan der Henst et al\. \(2002 (https://arxiv.org/html/2604.15873#bib.bib47)\); Mondorf and Plank \(2024 (https://arxiv.org/html/2604.15873#bib.bib26)\)Fill in the missing word marked with XXXX in the conclusion so that it logically follows from the set of statemenSimilar Articles
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Researchers apply contrastive LRP-based attribution to analyze why LLMs fail on realistic benchmarks, finding the method gives useful signals in some cases but is not universally reliable.