Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

arXiv cs.CL Papers

Summary

Researchers from University of Technology Sydney compare fine-tuned transformers (DistilBERT, RoBERTa) against zero-shot LLMs (Llama variants, Claude, Gemini) for classifying misinformation responses on Reddit, finding that fine-tuned RoBERTa achieves 0.62 macro-F1 versus 0.50 for the best zero-shot model. The study shows that task-specific fine-tuning outperforms larger generalist models, particularly for detecting belief propagation, and that safety-alignment artifacts in frontier models can degrade performance.

arXiv:2606.04274v1 Announce Type: new Abstract: As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:13 AM

# Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
Source: [https://arxiv.org/html/2606.04274](https://arxiv.org/html/2606.04274)
\\fnmLin\\surTian\\fnmAngela\\surBrillantes\\fnmAdriana\-Simona\\surMihăit,ă\\fnmMarian\-Andrei\\surRizoiu\\orgnameUniversity of Technology Sydney,\\orgaddress\\citySydney,\\countryAustralia

###### Abstract

As large language models \(LLMs\) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse\. We test this assumption directly on 900 Reddit comments spanning three PolitiFact\-verified misinformation claims \(environment, health, immigration\), labelled as*belief*\(propagates the claim\),*fact\-check*\(corrects it\), or*other*\. We compare nine models across three paradigms — BART\-MNLI, three Llama variants, three commercial frontier LLMs \(Claude Haiku 4\.5, Gemini Flash Lite 2\.5, Claude Sonnet 4\.6\), and fine\-tuned DistilBERT and RoBERTa — under universal and topic\-specific label schemas\.

The assumption does not hold\. Fine\-tuned RoBERTa reaches 0\.62 macro\-F1F\_\{1\}against a best zero\-shot result of 0\.50 \(Claude Haiku 4\.5\), at a fraction of the per\-query cost; the supervised advantage is concentrated on the*belief*class, the implicit, affective category every zero\-shot model under\-detects\. Scaling does not help: Llama\-3\-8B matches Llama\-3\-70B, and Claude Sonnet 4\.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0\.17 and refusing outright on a subset of comments flagged as sensitive\. This is a safety\-alignment artefact, not a capacity limit\. Label schema and topic jointly shape zero\-shot performance, with the same model varying by more than 0\.13 macro\-F1F\_\{1\}across topics under matched labels\. In a verification context, where missing belief is the costlier error, task\-specific fine\-tuning remains the more reliable choice despite the proliferation of large generative models\.

###### keywords:

Misinformation detection, misinformation response classification, social media analysis, transformer models, zero\-shot learning, fine\-tuning

## 1Introduction

Large language models \(LLMs\) are now embedded in search engines, virtual assistants, and conversational interfaces such as ChatGPT\[[1](https://arxiv.org/html/2606.04274#bib.bib1)\], Gemini\[[34](https://arxiv.org/html/2606.04274#bib.bib34)\], and Claude111[https://www\-cdn\.anthropic\.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model\_Card\_Claude\_3\.pdf](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf)\. When users encounter a claim that might be false, they increasingly turn to these systems for verification rather than to dedicated fact\-checking tools\. This deployment pattern carries an implicit assumption: that scale and general capability are sufficient for the fine\-grained classification of misinformation discourse\. We test that assumption directly\. Can today’s zero\-shot LLMs, including recent commercial frontier models, classify how people respond to misinformation as reliably as small, task\-specific fine\-tuned transformers?

Classifying responses to misinformation is non\-trivial\. A comment on a social media post may*propagate*the false claim \(*belief*\),*challenge*it \(*fact\-check*\), or be neutral or unrelated \(*other*\)\. Distinguishing these is central to understanding how misinformation spreads and how communities contest it\[[32](https://arxiv.org/html/2606.04274#bib.bib32),[37](https://arxiv.org/html/2606.04274#bib.bib37),[8](https://arxiv.org/html/2606.04274#bib.bib8)\]\. Unlike sentiment or topic classification, the task requires inferring the commenter’s*epistemic*orientation relative to a specific claim, a judgement sensitive to label design, model architecture, and the discourse of each topic\. The three classes are also linguistically asymmetric: fact\-check responses tend to carry explicit corrective markers, such as cited evidence or direct contradictions, whereas belief is often implicit, affective, or sarcastic\. This asymmetry recurs across our results and makes the*belief*class the most difficult to detect and, in a verification setting, the costliest to miss\.

Two paradigms dominate the literature on misinformation response and stance classification\[[2](https://arxiv.org/html/2606.04274#bib.bib2),[25](https://arxiv.org/html/2606.04274#bib.bib25),[14](https://arxiv.org/html/2606.04274#bib.bib14)\]: zero\-shot inference\[[41](https://arxiv.org/html/2606.04274#bib.bib41),[31](https://arxiv.org/html/2606.04274#bib.bib31)\]and supervised fine\-tuning\[[10](https://arxiv.org/html/2606.04274#bib.bib10),[23](https://arxiv.org/html/2606.04274#bib.bib23)\]\. Two zero\-shot architectural families approach the task differently\. Natural language inference \(NLI\) models such as BART\-MNLI\[[22](https://arxiv.org/html/2606.04274#bib.bib22)\]evaluate each label as a hypothesis against the input text, while generative LLMs\[[6](https://arxiv.org/html/2606.04274#bib.bib6),[36](https://arxiv.org/html/2606.04274#bib.bib36)\]produce class predictions through prompt\-based completion\. Recent commercial systems combine instruction tuning with safety alignment, and how these features interact with classification under ambiguous label wording remains poorly understood\. Prior empirical comparisons either predate the current commercial frontier or examine one paradigm in isolation\.

We compare nine models on 900 Reddit comments spanning three PolitiFact\-verified misinformation claims, one each in environment, health, and immigration\. The zero\-shot setting includes BART\-MNLI, three Llama variants \(Llama\-3\.2\-3B, Llama\-3\-8B, Llama\-3\-70B\), and three commercial frontier models \(Claude Haiku 4\.5, Gemini Flash Lite 2\.5, Claude Sonnet 4\.6\)\. The supervised setting includes DistilBERT and RoBERTa\. We evaluate under two label schema conditions, universal labels shared across all topics and topic\-specific labels tailored to each claim, using stratified 5\-fold cross\-validation; permutation tests across all 900 predictions provide statistical significance for the main comparisons\.

We organise our analysis around four empirical findings\. First, label schema and topic jointly shape zero\-shot performance: topic\-specific labels can substantially improve macro\-F1F\_\{1\}\(e\.g\., from 0\.31 to 0\.54,Δ=0\.23\\Delta\{=\}0\.23,pHolm<0\.01p\_\{\\text\{Holm\}\}\{<\}0\.01in the environment topic\), but the gain does not generalise uniformly across topics, and the same model can vary by more than 0\.13 macro\-F1F\_\{1\}across topics under matched labels\. Second, scaling does not consistently help: Llama\-3\-8B matches or surpasses Llama\-3\-70B in several settings, and the more capable Claude Sonnet 4\.6 underperforms the smaller Claude Haiku 4\.5 under generic labels \(0\.42 vs\. 0\.50\)\. This inversion stems from safety alignment rather than limited capacity: Sonnet collapses belief detection to 0\.17 macro\-F1F\_\{1\}and refuses outright on a subset of comments flagged as sensitive\. Third, supervised fine\-tuning outperforms every zero\-shot configuration tested, and its advantage is concentrated on thebeliefclass: RoBERTa reachesF1=0\.52F\_\{1\}=0\.52versus at most 0\.34 for any zero\-shot model with balanced overall performance\. Fourth, BART\-MNLI remains a strong zero\-shot baseline: it combines balanced predictions with low inference cost and is the most efficient option when supervised data are unavailable\.

These results point to a recurring failure mode in current LLM\-based verification: zero\-shot models, including the proprietary frontier, under\-detect belief\-propagating content, the class whose detection matters most for verification\. Fine\-tuned RoBERTa achieves 0\.62 macro\-F1F\_\{1\}, well above the best zero\-shot result \(0\.50\), at a fraction of the inference and per\-query cost\. Task\-specific fine\-tuning remains a competitive strategy despite the rapid growth of large generative models\.

Contributions\.This paper makes the following contributions:

- •A head\-to\-head comparison of zero\-shot LLMs, including recent commercial frontier models, against fine\-tuned transformersfor misinformation response classification\. Across three topics, two label schemas, and 5\-fold cross\-validation with permutation\-based significance testing, we show that fine\-tuned RoBERTa \(0\.62 macro\-F1F\_\{1\}\) outperforms every zero\-shot configuration tested \(best: Claude Haiku 4\.5 at 0\.50\), at a fraction of the per\-query cost\. This challenges the common assumption that scale and generality are sufficient for fine\-grained verification tasks\.
- •Evidence that the supervised advantage is concentrated on the belief class, the class that matters most for verification\.Every zero\-shot model we test under\-detects belief, and commercial safety alignment widens the gap: Claude Sonnet 4\.6’s beliefF1F\_\{1\}collapses to 0\.17 under generic labels, accompanied by outright refusals on a subset of comments flagged as sensitive\. Fine\-tuning closes this gap byΔ=\+0\.18\\Delta\{=\}\+0\.18over the best balanced zero\-shot model, so the supervised advantage extends to the higher\-stakes class\.
- •A curated, publicly released dataset of 900 Reddit commentsannotated for stance towards three PolitiFact\-verified misinformation claims \(one each in environment, health, and immigration\), with a balanced class distribution, to support future research on topic\-dependent model behaviour\.222We will release the codebook, the labelled dataset, and relevant scripts for data processing, model inference, and evaluation upon paper acceptance\.

## 2Related Work

Research on misinformation detection intersects several areas of natural language processing and computational social science, including response classification toward misinformation claims, rumour verification, transformer\-based text classification, and social media discourse analysis\. This section summarises prior work most directly related to classifying user responses to misinformation in online discussions\.

### 2\.1Classifying Responses to Misinformation

Classifying how users respond to misinformation claims is closely related to the established field of stance detection, though distinct in scope and taxonomy\. The stance detection task was formally introduced in the SemEval\-2016 Task 6 shared task\[[25](https://arxiv.org/html/2606.04274#bib.bib25)\], which defined the canonical classification framework ofFavour,Against, andNeithertoward a specified target\. Our task differs: rather than classifying attitude toward a target entity, we classify the*type of response*to a misinformation claim \(belief, fact\-check, or other\)\. Nevertheless, stance detection provides important methodological foundations\. The competition attracted numerous systems and highlighted the difficulty of generalising such classification across targets\. Subsequent analysis\[[26](https://arxiv.org/html/2606.04274#bib.bib26)\]demonstrated that stance and sentiment are related but distinct phenomena, a distinction that is particularly relevant in misinformation discourse where sarcastic or ironic statements may express positive sentiment while implicitly rejecting a claim\.

Early neural approaches significantly advanced stance detection\.Augenstein et al\. \[[2](https://arxiv.org/html/2606.04274#bib.bib2)\]introduced conditional LSTM encoding that models the relationship between a target and the surrounding text, demonstrating that target\-dependent representations improve stance classification\. This work established the importance of explicitly modelling the relationship between the claim and the response text, a challenge that remains central in misinformation stance detection where the claim may be implicit rather than explicitly stated\.

The RumourEval shared tasks further extended stance detection research to misinformation contexts\.Derczynski et al\. \[[9](https://arxiv.org/html/2606.04274#bib.bib9)\]introduced the SDQC framework—Support,Deny,Query, andComment—for classifying responses to rumours in Twitter conversation threads\. The RumourEval datasets contain thousands of annotated tweets across breaking news events and have become a standard benchmark for rumour stance classification\. Subsequent systems\[[21](https://arxiv.org/html/2606.04274#bib.bib21),[12](https://arxiv.org/html/2606.04274#bib.bib12)\]explored neural architectures for modelling conversation structures and jointly predicting rumour stance and veracity\.

Comprehensive surveys of misinformation stance detection highlight the growing importance of this task within automated fact\-checking pipelines\.Hardalov et al\. \[[14](https://arxiv.org/html/2606.04274#bib.bib14)\]review approaches spanning feature\-based methods, neural architectures, and transformer models, emphasising that stance detection provides an important signal for downstream misinformation verification\. Similarly, surveys of automated fact\-checking\[[13](https://arxiv.org/html/2606.04274#bib.bib13)\]describe stance classification as a key component of systems designed to identify and analyse misleading claims circulating online\.

### 2\.2Transformer Models for Misinformation Response Classification

Recent advances in natural language processing have been driven largely by transformer architectures\. BERT\[[10](https://arxiv.org/html/2606.04274#bib.bib10)\]introduced the pre\-train–then–fine\-tune paradigm using masked language modelling and next\-sentence prediction objectives, enabling strong performance across a wide range of text classification tasks\. RoBERTa\[[23](https://arxiv.org/html/2606.04274#bib.bib23)\]subsequently demonstrated that BERT could be significantly improved through optimised training procedures, including dynamic masking and larger training corpora\.

DistilBERT\[[30](https://arxiv.org/html/2606.04274#bib.bib30)\]applies knowledge distillation to produce a smaller and faster model while retaining most of BERT’s capabilities\. The resulting model is approximately 40% smaller and 60% faster while preserving around 97% of BERT’s performance, making it particularly attractive for practical applications where computational efficiency is important\.

Another relevant development is the use of natural language inference models for zero\-shot text classification\.Yin et al\. \[[41](https://arxiv.org/html/2606.04274#bib.bib41)\]showed that classification tasks can be reframed as textual entailment problems by treating the input text as a premise and candidate labels as hypotheses\. Models trained on the Multi\-Genre Natural Language Inference \(MNLI\) dataset can therefore perform classification without task\-specific training data\. This approach underlies the widely used BART\-large\-MNLI model\[[22](https://arxiv.org/html/2606.04274#bib.bib22)\], which has become a standard baseline for zero\-shot classification tasks\.

Transformer models have been widely applied to misinformation and stance detection tasks\. For example,Karande et al\. \[[19](https://arxiv.org/html/2606.04274#bib.bib19)\]combine BERT\-based stance features with article content to evaluate credibility, whileKawintiranon and Singh \[[20](https://arxiv.org/html/2606.04274#bib.bib20)\]incorporate external knowledge into transformer\-based stance models\. Other studies use hierarchical transformer architectures to model conversation structures in rumour discussions\[[42](https://arxiv.org/html/2606.04274#bib.bib42)\]\. These works demonstrate that transformer\-based representations capture contextual information effectively, enabling improved performance in misinformation detection tasks\.

### 2\.3Low\-Resource and Few\-Shot Text Classification

A recurring challenge in misinformation research is the limited availability of labelled datasets\. Manual annotation of stance or misinformation labels is time\-consuming and often requires domain expertise\. As a result, several studies explore approaches for learning effectively with limited labelled data\.

Large language models have demonstrated strong few\-shot learning capabilities\.Brown et al\. \[[6](https://arxiv.org/html/2606.04274#bib.bib6)\]show that scaling model size enables models to perform new tasks with only a small number of examples provided in the prompt\. Alternative approaches focus on improving performance in low\-resource settings through task reformulation\. For example, Pattern\-Exploiting Training \(PET\)\[[31](https://arxiv.org/html/2606.04274#bib.bib31)\]reformulates classification problems as cloze\-style language modelling tasks, enabling models to leverage pre\-training knowledge more effectively\.

Other studies investigate data augmentation techniques for small datasets\.Wei and Zou \[[38](https://arxiv.org/html/2606.04274#bib.bib38)\]demonstrate that simple augmentation strategies—such as synonym replacement, word insertion, and deletion—can significantly improve classification performance when training data is limited\. Surveys of low\-resource NLP methods\[[16](https://arxiv.org/html/2606.04274#bib.bib16)\]describe a range of approaches including transfer learning, distant supervision, and domain\-adaptive pre\-training\.

Despite these advances, there remains limited empirical evidence on how much labelled data is required for reliable misinformation response classification\. Quantifying the relative performance of zero\-shot and fine\-tuned models in low\-resource settings therefore remains an important research question\.

### 2\.4Social Media Platforms and Reddit\-Based Research

Most misinformation detection research has focused on Twitter due to the availability of publicly accessible datasets\. However, other social media platforms exhibit different structural and conversational characteristics that may influence how misinformation spreads and is contested\.

Reddit provides a distinct environment for analysing misinformation discourse\. The platform is organised into topic\-specific communities \(subreddits\) and uses threaded discussions that capture reply relationships between comments\. These design features enable more extended debates compared with the short message format of Twitter\. Studies of online communities have shown that such structures can foster both echo chambers and active debate within discussion threads\[[8](https://arxiv.org/html/2606.04274#bib.bib8),[39](https://arxiv.org/html/2606.04274#bib.bib39)\]\.

Although most misinformation detection research has focused on Twitter datasets, Reddit provides a complementary environment for studying misinformation discourse due to its topic\-based communities and threaded discussion structure\.

Another linguistic challenge in online discussions is the frequent use of sarcasm and irony\. Surveys of sarcasm detection\[[18](https://arxiv.org/html/2606.04274#bib.bib18)\]highlight the difficulty of automatically identifying sarcastic language, which often relies on contextual knowledge and pragmatic cues\. Sarcasm detection studies such as CASCADE\[[15](https://arxiv.org/html/2606.04274#bib.bib15)\]demonstrate that sarcastic expressions frequently occur in online forums, posing a challenge for automated stance classification\.

### 2\.5Research Gap

Despite substantial research on stance detection and misinformation analysis, three gaps remain\. First, most studies focus on Twitter datasets, leaving misinformation discourse on Reddit relatively understudied despite the platform’s active discussion communities and threaded conversation structure\. Second, few studies directly compare zero\-shot and fine\-tuned transformer models for classifying user responses to misinformation claims\. Third, with users increasingly relying on LLMs for information verification, the zero\-shot capability of these models to identify misinformation responses has direct practical significance that remains underexplored\. This study addresses these gaps by evaluating multiple transformer architectures on Reddit discussions across several misinformation topics\.

## 3Dataset

### 3\.1Data Sources, Misinformation Claims, and Collection

Platform\.Reddit was selected as the primary data source for three reasons\. First, its design encourages open, threaded debate with minimal editorial intervention, making it a naturalistic environment for studying misinformation discourse\. Second, the platform’s community structure \(subreddits with varying moderation norms and political orientations\) enables cross\-community comparison of how the same claim is received\. Third, Reddit data are accessible via a public API and have been studied for related tasks including rumour detection and opinion polarisation\[[43](https://arxiv.org/html/2606.04274#bib.bib43),[12](https://arxiv.org/html/2606.04274#bib.bib12)\], providing methodological precedent for our annotation scheme\. In this study, we rely exclusively on the textual content of posts and comments, without incorporating Reddit’s underlying social network structure\.

Misinformation Source\.The entry point into the study are three misinformation claims verified by PolitiFact, each with a substantial presence on Reddit\. These claims were selected to represent diverse misinformation types \(a viral fabrication, a recurring political misquotation, and a visually manipulated image\) across three topics \(environment, health, and immigration\)\. PolitiFact\[[28](https://arxiv.org/html/2606.04274#bib.bib28)\]is a nonpartisan fact\-checking outlet managed by the Poynter Institute for Media Studies that independently investigates claims using on\-the\-record sources and publishes transparent methodology alongside each verdict; claims are rated on a six\-point “Truth\-O\-Meter” scale ranging from*True*to*Pants on Fire*\. We restricted selection to claims rated*False*,*Mostly False*, or*Pants on Fire*, ensuring that all misinformation in the dataset has been independently verified as inaccurate or misleading\.

Misinformation Claims\.The three PolitiFact\-verified claims were selected on the basis of \(1\) measurable presence and community engagement across multiple subreddits, \(2\) topic diversity to enable cross\-topic generalisation, and \(3\) verified falsity on the Truth\-O\-Meter\.[Table˜1](https://arxiv.org/html/2606.04274#S3.T1)summarises each claim together with its raw corpus statistics; the full dataset comprises65,17365\{,\}173comments across6262threads from5454unique subreddits\. The health and immigration topics exhibit substantially higher per\-thread engagement \(median 840 and 646 comments, respectively, versus 65 for environment\), likely reflecting their association with prominent political figures and their emotionally polarising nature\. Detailed topic context is provided in[Appendix˜A](https://arxiv.org/html/2606.04274#A1)\.

Table 1:Misinformation claims selected for the study, with raw corpus collection statistics \(prior to annotation sampling\)\. PolitiFact ratings: F = False, MF = Mostly False, PoF = Pants on Fire\. Mean and Median refer to comments per thread \(c\./t\.\)\.Water ScandalBleach CureGang TattoosTopicEnvironmentHealthImmigrationClaimOne billionaire couple owns almost all the water in California\[[24](https://arxiv.org/html/2606.04274#bib.bib24)\]Donald Trump told Americans to drink or inject bleach to cure COVID\-19\[[7](https://arxiv.org/html/2606.04274#bib.bib7),[33](https://arxiv.org/html/2606.04274#bib.bib33)\]MS\-13 tattoos on Kilmar García’s knuckles prove gang membership\[[17](https://arxiv.org/html/2606.04274#bib.bib17)\]RatingFalseMostly FalsePants on FireThreads271520Comments5,02238,13022,021Mean c\./t\.1862,5421,101Median c\./t\.65840646Data Collection\.Data collection used the Python Reddit API Wrapper \(PRAW\)\[[5](https://arxiv.org/html/2606.04274#bib.bib5)\], which provides access to Reddit’s public API\. Relevant submissions were identified through keyword searches across public subreddits; search terms were derived from key phrases in each PolitiFact claim \(e\.g\., “California water billionaire,” “Trump bleach COVID,” “Garcia MS\-13 tattoo”\)\. For claims with a visual component, supplementary Google Image searches with the keyword “Reddit” were conducted to locate image\-based posts and their comment trees\. For each submission, the post title, body text, timestamp, score, subreddit name, and full comment hierarchy \(including parent–child relationships\) were retrieved\. We removed non\-English content and comments that were deleted or contained fewer than three words\.

### 3\.2Annotation

Sampling Strategy\.From the 65,173 collected comments, 900 were selected for manual annotation \(300 per topic, 100 per stance class\)\. To identify candidate comments from this large, class\-imbalanced pool, ChatGPT \(GPT\-4o, accessed interactively via the web interface, July 2025\) was used as an auxiliary pre\-filtering tool: given the three class definitions for each topic \(see below\), it surfaced comments plausibly matching each category as candidate selections\. These suggestions were then reviewed and confirmed or rejected by the human annotator, who made all final labelling decisions\. The forced\-balanced design \(exactly 100 per class per topic\) was adopted to ensure that macro\-F1F\_\{1\}scores are not distorted by class imbalance and that each stance category contributes equally to evaluation\. The natural class distribution in the full corpus is unknown and likely highly imbalanced; our study evaluates how well models distinguish between the three response types under a controlled, balanced setup\.

The use of GPT\-4o as a pre\-filtering tool could in principle bias the sample towards comments that are more recognisable to LLMs, inflating zero\-shot performance estimates\. We assess this using the seven zero\-shot classifiers evaluated in this study \(BART\-MNLI and six generative models; see[Sections˜4](https://arxiv.org/html/2606.04274#S4)and[5](https://arxiv.org/html/2606.04274#S5)\): BART\-MNLI confidence was marginally higher for the labelled pool than for a stratified random sample of 900 unlabelled comments from the same corpus \(mean 0\.62 vs\. 0\.60; Mann\-WhitneyUU,p=0\.004p\\,\{=\}\\,0\.004\), and only 25\.2% of labelled comments achieved≥\\geq6/7 model agreement\. Both findings point to a negligible effect \(Cohen’sd=0\.13d\\,\{=\}\\,0\.13\); we conclude the pre\-filtering introduced no practically meaningful bias\. Full details are in[Appendix˜B](https://arxiv.org/html/2606.04274#A2)\.

Stance Taxonomy\.Each comment was labelled with one of three stance categories:

- •Belief— the comment supports, propagates, or treats as credible the misinformation claim\.
- •Fact\-check— the comment corrects, challenges, or provides evidence against the claim\.
- •Other— the comment is neutral, off\-topic, meta\-commentary, or unrelated to the claim\.

This three\-class schema aligns with established frameworks in stance and rumour analysis, including the SDQC taxonomy \(Support/Deny/Query/Comment\) ofZubiaga et al\. \[[43](https://arxiv.org/html/2606.04274#bib.bib43)\]as operationalised in the RumourEval shared tasks\[[9](https://arxiv.org/html/2606.04274#bib.bib9),[12](https://arxiv.org/html/2606.04274#bib.bib12)\]\. Our*belief*category corresponds to Support,*fact\-check*to Deny, and*other*merges Query and Comment into a single residual class—consistent with prior three\-class simplifications such as the For/Against/Observing taxonomy ofFerreira and Vlachos \[[11](https://arxiv.org/html/2606.04274#bib.bib11)\]and the Supported/Refuted/NotEnoughInfo schema of FEVER\[[35](https://arxiv.org/html/2606.04274#bib.bib35)\]\.

Because the three abstract class labels are insufficient to guide annotation of topic\-specific discourse, each category was operationalised with concrete thematic descriptions per topic \([Table˜2](https://arxiv.org/html/2606.04274#S3.T2)\)\. These same descriptions later informed the design of topic\-specific zero\-shot label sets \([Section˜4](https://arxiv.org/html/2606.04274#S4)\)\.

Table 2:Label descriptions for each topic\.TopicLabelDescriptionWater ScandalBeliefReactionary comments expressing public outrage, anti\-billionaire sentiment, or political anger; calls to action\.\(Environment\)Fact\-checkEvidence\-based refutations, scepticism citing credible sources, or corrections clarifying ownership\.OtherOff\-topic remarks or unrelated commentary\.COVID BleachBeliefAnger, ridicule, or reinforcement of the claim that Trump suggested drinking bleach; political humour\.\(Health\)Fact\-checkCorrections referencing the April 2020 briefing; criticism of media misrepresentation\.OtherOff\-topic or unrelated remarks\.Gang TattooBeliefAssertions that tattoos are genuine MS\-13 markings; defence of the post\.\(Immigration\)Fact\-checkIdentification of digital manipulation; references to expert assessments\.OtherOff\-topic or unrelated remarks\.Annotation Process and Inter\-Annotator Agreement\.Annotation was performed by two annotators; each of the 900 comments was evaluated as an independent text unit without access to surrounding thread context, reducing interpretive bias from conversational inference\. Inter\-annotator agreement was measured using Cohen’sκ\\kappa, yielding substantial overall agreement \(κ=0\.752\\kappa=0\.752, raw agreement 83\.4%;[Table˜3](https://arxiv.org/html/2606.04274#S3.T3)\)\. Agreement was highest for immigration \(κ=0\.800\\kappa=0\.800\) and lowest for environment \(κ=0\.710\\kappa=0\.710\), with all topics falling within the range of substantial agreement; the lowerκ\\kappafor environment reflects greater interpretive ambiguity in that topic rather than a systematic labelling failure\.

Table 3:Inter\-annotator agreement overall and by topic \(Cohen’sκ\\kappawith 95% confidence intervals\)\.Topicnnκ\\kappa95% CIAgreementOverall9000\.752\[0\.714, 0\.786\]83\.4%Immigration3000\.800\[0\.742, 0\.854\]86\.7%Health3000\.745\[0\.686, 0\.805\]83\.0%Environment3000\.710\[0\.644, 0\.775\]80\.7%Disagreements were predominantly driven by theothercategory \(≈78%\{\\approx\}78\\%of all disagreements\), where the boundary between on\-topic and tangential content is often implicit\. Direct confusion betweenbeliefandfact\-checkwas rare, occurring mainly in sarcastic or ironic comments where the intended stance is obscured by non\-literal language — a known challenge in stance annotation\[[43](https://arxiv.org/html/2606.04274#bib.bib43)\]\.

## 4Methods

We classify each Reddit comment into one of three response categories defined in[Section˜3\.2](https://arxiv.org/html/2606.04274#S3.SS2):*belief*\(propagates the misinformation\),*fact\-check*\(corrects it\), and*other*\(neutral or unrelated\)\. We compare three modelling paradigms on the same 900\-comment gold set: \(i\) natural\-language\-inference \(NLI\) zero\-shot classification with BART\-MNLI \([Section˜4\.1](https://arxiv.org/html/2606.04274#S4.SS1.SSSx1)\); \(ii\) generative zero\-shot classification with three open\-source Llama variants and three commercial LLMs \([Section˜4\.1](https://arxiv.org/html/2606.04274#S4.SS1.SSSx3)\); and \(iii\) supervised fine\-tuning of two encoder transformers \([Section˜4\.3](https://arxiv.org/html/2606.04274#S4.SS3)\)\. Candidate\-label schemas and the shared evaluation protocol are described in[Sections˜4\.2](https://arxiv.org/html/2606.04274#S4.SS2)and[4\.4](https://arxiv.org/html/2606.04274#S4.SS4)\.

### 4\.1Zero\-Shot Classifiers

All seven zero\-shot classifiers receive the same comment text and the same candidate label set for a given configuration \(UL or topic\-specific; see[Section˜4\.2](https://arxiv.org/html/2606.04274#S4.SS2)\); they differ in how the classification decision is formulated\.

#### NLI baseline: BART\-MNLI

We usefacebook/bart\-large\-mnli, a BART encoder–decoder fine\-tuned on the Multi\-Genre NLI corpus\[[40](https://arxiv.org/html/2606.04274#bib.bib40)\]\. FollowingYin et al\. \[[41](https://arxiv.org/html/2606.04274#bib.bib41)\], zero\-shot classification is reframed as textual entailment: each candidate label<ℓ\><\\ell\>is converted into a hypothesis of the form*“This text is about <ℓ\\ell\>”*, and the model scores the entailment probability of each hypothesis given the comment as premise\. The label with the highest entailment probability is returned as the prediction\. This formulation produces a single probability vector over the candidate labels per comment and is fully deterministic\. The MNLI training objective is well suited to short, opinionated text because the candidate hypotheses can be expressed as natural\-language descriptions of stance categories rather than category indices\.

#### Open\-source generative LLMs: Llama family

We evaluate three Llama variants released by Meta: Llama\-3\.2\-3B, Llama\-3\.1\-8B, and Llama\-3\-70B \(the latter in 4\-bit quantised form\)\. All three are served locally on a GPU server equipped with two NVIDIA A40 GPUs \(45 GB VRAM each\), via Ollama\. Inference usestemperature==0, no system prompt, and no other sampling options changed from defaults; this yields deterministic outputs\. A single prompt template is used across all three models, all topics, and both label schemas:

> You are a careful annotator\. Choose exactly ONE label for the comment\. Candidate labels: \{label\_block\} Comment: \{text\} Return ONLY the exact label text\.

\{label\_block\}is the bullet\-list of candidate labels for the active schema \([Table˜4](https://arxiv.org/html/2606.04274#S4.T4)\);\{text\}is the comment\. Free\-text outputs are mapped to the three core categories using exact\-match followed by substring\-match against the candidate strings; unmappable outputs \(under 0\.5% of calls\) are retried with the same prompt\.

#### Commercial generative LLMs

We evaluate three commercial LLMs333Anthropic and Google LLMs were accessed via vendor command\-line clients between 2 and 9 April 2026\.: Anthropic’s Claude Haiku 4\.5 \(claude\-haiku\-4\-5\-20251001\) and Claude Sonnet 4\.6 \(claude\-sonnet\-4\-6; withclaude\-sonnet\-4\-20250514as a fallback on policy refusals\), and Google’s Gemini Flash Lite 2\.5 \(gemini\-2\.5\-flash\-lite\)\. All three models were queried via their respective vendor CLIs with an empty tool configuration\. The CLIs do not expose temperature or other sampling parameters; outputs were effectively deterministic, as repeated calls on the same comment produced identical labels\. The user prompt is otherwise identical to the Llama template above \(UL\) or its topic\-specific variant \(Output ONLY the exact label text\.\), so the only intentional across\-model difference is the underlying model\. Claude Sonnet was the only model to occasionally trigger usage\-policy refusals \(5 / 900 calls\); these were retried with the dated fallback checkpointclaude\-sonnet\-4\-20250514, which returned a valid label in every case\. Calls were issued with five concurrent workers and a 60\-s \(Claude\) or 180\-s \(Gemini\) per\-call timeout; failed or unmappable responses were retried up to three times\. Using this strategy, all data points were succesfully labelled by all models\.

### 4\.2Candidate Label Schemas

To isolate the effect of label design from model capacity, all seven zero\-shot classifiers are run under two label schemas while keeping comments, gold annotations, and model weights fixed \([Table˜4](https://arxiv.org/html/2606.04274#S4.T4)\)\. The*universal label*schema \(UL\) applies a single set of three generic natural\-language hypotheses —*spreads misinformation*,*corrects misinformation*, and*provides neutral or unrelated commentary*— across all topics, mapping directly to the three gold categories\. The*topic\-specific*schemas \(E1, H1, I1\) replace these with six fine\-grained sub\-labels per topic, derived by qualitative inspection of the annotated comments to identify the dominant discourse patterns within each gold category\. For example, the UL label*spreads misinformation*\(belief\) resolves into*outrage*and*endorse*in the environment topic \(E1\), reflecting the two dominant forms of belief propagation in the water\-ownership discourse: emotionally charged amplification and explicit endorsement of the false claim\. In the health topic \(H1\), the same category maps to*insistence*, capturing the pattern of repeatedly asserting the bleach\-cure narrative; in the immigration topic \(I1\), it maps to*criminal*, reflecting comments that affirm the MS\-13 gang\-tattoo framing\. Similarly, the residual*other*category decomposes into*scepticism*,*meta*, and*off\-topic*across all three topics, with I1 adding*photoshop*to capture the large share of comments discussing the image\-manipulation evidence\. Topic\-specific predictions are mapped back to the three gold categories through a fixed many\-to\-three mapping, so performance is computed identicaly across schemas\. We show the full mapping in[Table˜9](https://arxiv.org/html/2606.04274#A1.T9)\([Appendix˜A](https://arxiv.org/html/2606.04274#A1)\)\. The comparative performance of UL and topic\-specific schemas across all seven zero\-shot classifiers is reported in[Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\.

Table 4:Candidate label sets used by all zero\-shot classifiers\. UL is applied across topics; E1, H1, I1 are applied within their respective topic\. Topic\-specific labels are mapped back to the three gold categories before scoring\.SetCandidate labelsScopeULspreads misinformation; corrects misinformation; provides neutral or unrelated commentaryAll topicsE1outrage; endorse; fact\-check; scepticism; meta; off\-topicEnvironmentH1disbelief; insistence; clarification; scepticism; meta; off\-topicHealthI1criminal; insistence; clarification; photoshop; meta; off\-topicImmigration
### 4\.3Fine\-Tuned Transformer Classifiers

Two encoder transformers are fine\-tuned on the same 900\-comment dataset:distilbert\-base\-uncased\[[30](https://arxiv.org/html/2606.04274#bib.bib30)\], a knowledge\-distilled variant of BERT that retains most of BERT’s downstream performance with roughly 40% fewer parameters, androberta\-base\[[23](https://arxiv.org/html/2606.04274#bib.bib23)\], a case\-sensitive masked\-language\-model encoder pretrained on a larger and more diverse corpus than BERT under a refined optimisation regime \(longer training, dynamic masking, no next\-sentence\-prediction objective\)\. Each model uses its native tokeniser \(WordPiece for DistilBERT, byte\-pair encoding for RoBERTa\); inputs are truncated to a maximum length of 256 sub\-word tokens\.

Both models are fine\-tuned with the same hyperparameter configuration based on literature’s best practices: AdamW, learning rate2×10−52\{\\times\}10^\{\-5\}, weight decay0\.010\.01, batch size 8, up to 10 epochs, with early stopping \(patience 2 epochs\) on validation macroF1F\_\{1\}over an internal 15% validation split drawn from the training fold; the checkpoint with the highest validation macroF1F\_\{1\}is restored at the end of training\. Training is performed independently for each of the five cross\-validation folds \([Section˜4\.4](https://arxiv.org/html/2606.04274#S4.SS4)\) and for each of the three topics in the per\-topic setting, using a fixed seed for reproducibility\.

### 4\.4Evaluation Protocol

All nine models are evaluated under a single shared protocol so that zero\-shot and fine\-tuned numbers are directly comparable\. We use stratified 5\-fold cross\-validation, stratified per topic so that each fold contains 60 test instances per topic \(20 per gold class\), 240 training instances per topic per fold for the fine\-tuned models, and 180 test instances overall\. The same fold assignments are reused for every model, including zero\-shot classifiers, so each fold’s test partition is identical across models and per\-fold scores are paired across models for significance testing\.

Metric\.We report macroF1F\_\{1\}over the three gold classes \(belief, fact\-check, other\), together with per\-classF1F\_\{1\}scores\. MacroF1F\_\{1\}weighs the three classes equally, which is appropriate for the balanced dataset \(100 instances per class per topic\)\. We report two versions:*pooled macroF1F\_\{1\}*is computed over the full set of 180 test instances per fold \(used in[Table˜5](https://arxiv.org/html/2606.04274#S5.T5)\), while*topic\-averaged macroF1F\_\{1\}*is the mean of per\-topic macroF1F\_\{1\}values within a fold \(used in[Table˜7](https://arxiv.org/html/2606.04274#S5.T7)\)\. The pooled convention summarises overall performance across all three topics; the topic\-averaged convention exposes per\-topic variability and is used wherever topic\-level comparisons are the focus\. Topic\-specific label formulations \(E1, H1, I1\) are only applicable in the zero\-shot setting, where label semantics are explicitly encoded in the inference prompt; fine\-tuned models operate over fixed label indices and do not incorporate label descriptions\.

Variance reporting\.For zero\-shot models, predictions are deterministic at the comment level; reported standard deviations therefore reflect only the composition of each fold’s test partition\. For fine\-tuned models, training is repeated independently per fold and reported variance captures both training stochasticity and partition composition\.

Significance\.Pairwise comparisons in the headline findings are accompanied by paired significance tests across the same five folds \(Wilcoxon signed\-rank for within\-paradigm comparisons\) and by a permutation test on the full 900\-comment prediction set \(Holm–Bonferroni\-corrected across the headline\-claim family\)\. The full test design and all reportedpp\-values are given in[Appendix˜C](https://arxiv.org/html/2606.04274#A3); in the main text we annotate each comparative claim with the correspondingpHolmp\_\{\\text\{Holm\}\}\. Across the seventeen headline pairwise comparisons, seven reach formal significance \(pHolm<0\.01p\_\{\\text\{Holm\}\}<0\.01\); the remainder show directional consistency \([Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\)\.

Compute and reproducibility\.BART\-MNLI and the three Llama variants are run on a single NVIDIA A100 80GB GPU; DistilBERT and RoBERTa are fine\-tuned on the same machine\. Commercial\-model calls are issued over the public APIs of the respective vendors \(see[Section˜4\.1](https://arxiv.org/html/2606.04274#S4.SS1.SSSx3)\)\. We will release the codebook, the labelled dataset and relevant scripts for data processing, model inference, and evaluation upon paper acceptance\.

## 5Results

This section presents the experimental results for both zero\-shot and fine\-tuned misinformation response classification across the three misinformation topics\. We organise the analysis around four findings\. We first examine zero\-shot classification, asking how label formulation interacts with topic discourse \([Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\) and whether model scale drives classification performance \([Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\)\. We then turn to deployment considerations, identifying BART\-MNLI as a surprisingly strong and efficient zero\-shot baseline \([Section˜5\.3](https://arxiv.org/html/2606.04274#S5.SS3)\)\. Finally, we show that supervised fine\-tuning outperforms every zero\-shot configuration we tested, and that its advantage is concentrated precisely where zero\-shot models struggle most — thebeliefclass \([Section˜5\.4](https://arxiv.org/html/2606.04274#S5.SS4)\)\.

### 5\.1Label schema matters — but unevenly across topics

Table 5:Class\-wise and macro\-F1F\_\{1\}scores under the Universal Label \(UL\) schema\. All models report mean±\\pmstandard deviation across the same stratified 5\-fold cross\-validation splits\. For fine\-tuned models, variance reflects both training sensitivity and dataset composition\. For zero\-shot models, predictions are fixed and deterministic; variance reflects dataset composition differences across folds only, and is consequently very low \(macro\-F1F\_\{1\}standard deviation≤\\leq0\.04 across all zero\-shot models\)\.SettingModelBeliefFact\-checkOtherMacro\-F1F\_\{1\}Zero\-shotBART\-Large\-MNLI0\.36±\\pm0\.020\.17±\\pm0\.030\.45±\\pm0\.020\.33±\\pm0\.01Llama\-3\.2\-3B0\.47±\\pm0\.010\.43±\\pm0\.080\.19±\\pm0\.060\.36±\\pm0\.02Llama\-3\-8B0\.32±\\pm0\.060\.45±\\pm0\.070\.50±\\pm0\.040\.42±\\pm0\.04Llama\-3\-70B0\.24±\\pm0\.080\.60±\\pm0\.050\.55±\\pm0\.020\.47±\\pm0\.02Claude Haiku 4\.50\.34±\\pm0\.040\.60±\\pm0\.040\.56±\\pm0\.020\.50±\\pm0\.03Gemini Flash Lite 2\.50\.32±\\pm0\.070\.59±\\pm0\.060\.53±\\pm0\.030\.48±\\pm0\.04Claude Sonnet 4\.60\.17±\\pm0\.080\.54±\\pm0\.060\.55±\\pm0\.020\.42±\\pm0\.02Fine\-tunedDistilBERT0\.52±\\pm0\.110\.67±\\pm0\.060\.58±\\pm0\.030\.59±\\pm0\.03RoBERTa0\.52±\\pm0\.120\.71±\\pm0\.070\.64±\\pm0\.050\.62±\\pm0\.07Table 6:Zero\-shot classificationF1F\_\{1\}scores under topic\-specific label sets \(E1, H1, I1\)\. Bold indicates the best value per column within each topic\.Topic\-specific labels \(E1/H1/I1\)TopicModelBeliefFact\-checkOtherMacro\-F1F\_\{1\}EnvironmentBART\-Large\-MNLI0\.620\.560\.440\.54Llama\-3\.2\-3B0\.530\.450\.190\.39Llama\-3\-8B0\.660\.680\.240\.53Llama\-3\-70B0\.090\.680\.510\.43Claude Haiku 4\.50\.730\.680\.340\.58Gemini Flash Lite 2\.50\.700\.710\.360\.59Claude Sonnet 4\.60\.720\.710\.520\.65HealthBART\-Large\-MNLI0\.380\.380\.260\.34Llama\-3\.2\-3B0\.480\.490\.190\.39Llama\-3\-8B0\.510\.390\.080\.33Llama\-3\-70B0\.190\.650\.560\.47Claude Haiku 4\.50\.480\.490\.230\.40Gemini Flash Lite 2\.50\.520\.460\.320\.43Claude Sonnet 4\.60\.460\.620\.490\.52ImmigrationBART\-Large\-MNLI0\.220\.310\.270\.27Llama\-3\.2\-3B0\.410\.330\.200\.31Llama\-3\-8B0\.260\.380\.390\.34Llama\-3\-70B0\.390\.440\.540\.46Claude Haiku 4\.50\.440\.580\.370\.46Gemini Flash Lite 2\.50\.480\.540\.480\.50Claude Sonnet 4\.60\.470\.610\.530\.54We examine whether zero\-shot classification is sensitive to label wording, comparing UL and topic\-specific schemas \(E1/H1/I1\) while keeping gold annotations and model weights fixed \(see[Section˜4\.2](https://arxiv.org/html/2606.04274#S4.SS2)\)\.

Topic\-specific labels help most in Environment\.Topic\-specific labels substantially outperform universal labels in the Environment topic — BART improves from 0\.31 to 0\.54 \(Δ=0\.23\\Delta\{=\}0\.23;pHolm<0\.01p\_\{\\text\{Holm\}\}\{<\}0\.01;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(f\)\) — but gains are negligible in Health \(Δ=\+0\.032\\Delta\{=\}\+0\.032\) and Immigration \(Δ=\+0\.002\\Delta\{=\}\+0\.002;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(f\)\)\. This asymmetry reflects discourse structure: the environment topic contains claim\-specific terminology that aligns well with richer label descriptions, while health and immigration discourse is already adequately captured by generic labels\. Gains are also class\-uneven: Llama\-3\-70B achieves fact\-checkF1=0\.68F\_\{1\}=0\.68under E1, but belief collapses to 0\.09, indicating that topic\-specific labels may amplify an existing bias toward the already\-dominant class rather than lifting performance across the board\.

Proprietary models have distinct class profiles\.Among proprietary models, all three outperform every open\-source zero\-shot model in macro\-F1F\_\{1\}in at least one topic, but with distinct profiles\. Claude Haiku 4\.5 achieves high beliefF1F\_\{1\}\(0\.73/0\.48/0\.44 across topics\) but is weak onother\(F1F\_\{1\}0\.23–0\.37\)\. Gemini Flash Lite 2\.5 shows the complementary pattern: betterothercoverage and the highest belief in Health \(0\.52\)\. Claude Sonnet 4\.6 achieves the highest macro\-F1F\_\{1\}in all three topics \(0\.65/0\.52/0\.54\) with the most balanced class profile, and benefits most from topic\-specific labels: the E1 gain in Environment \(Δ=\+0\.28\\Delta\{=\}\+0\.28;pHolm<0\.01p\_\{\\text\{Holm\}\}\{<\}0\.01\) and the I1 gain in Immigration \(Δ=\+0\.13\\Delta\{=\}\+0\.13;pHolm=0\.008p\_\{\\text\{Holm\}\}\{=\}0\.008\) are both significant \([Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(f\)\), whereas the H1 gain in Health is not \(Δ=\+0\.06\\Delta\{=\}\+0\.06\)\.

Topic is a first\-order factor\.Topic is itself a first\-order factor in classification difficulty\. Under topic\-specific labels \([Table˜6](https://arxiv.org/html/2606.04274#S5.T6)\), the same model can vary by more than 0\.13 macro\-F1F\_\{1\}across topics — Sonnet ranges from 0\.52 \(Health\) to 0\.65 \(Environment\), Haiku from 0\.40 \(Health\) to 0\.58 \(Environment\), and Gemini from 0\.43 \(Health\) to 0\.59 \(Environment\)\. These spreads are comparable in magnitude to differences between adjacent models within a topic, indicating that classification difficulty is shaped as much by topic discourse as by model capacity\. Health is the hardest topic for nearly every model, while Environment is the easiest — a pattern that persists under both label schemas\.

Overall, label schema and topic jointly shape zero\-shot classification, but gains from richer labels are conditional on semantic alignment with the discourse of each topic\. This dependency motivates a closer examination of model behaviour under a fixed label schema, which we analyse next\.

### 5\.2Bigger LLMs do not consistently improve classification

Table 7:Per\-topic macro\-F1F\_\{1\}under the Universal Label \(UL\) schema\. All models report mean±\\pmstandard deviation across the same stratified 5\-fold cross\-validation splits\. For zero\-shot models, variance reflects dataset composition differences only \(predictions are fixed\); for fine\-tuned models, variance additionally captures training sensitivity\. Bold indicates the best value per column within each setting\.SettingModelEnvironmentHealthImmigrationMacro\-F1F\_\{1\}Zero\-shotBART\-Large\-MNLI0\.31±\\pm0\.040\.28±\\pm0\.060\.27±\\pm0\.030\.29±\\pm0\.01Llama\-3\.2\-3B0\.41±\\pm0\.040\.39±\\pm0\.040\.30±\\pm0\.040\.36±\\pm0\.02Llama\-3\-8B0\.42±\\pm0\.070\.46±\\pm0\.050\.38±\\pm0\.070\.42±\\pm0\.04Llama\-3\-70B0\.45±\\pm0\.060\.49±\\pm0\.030\.45±\\pm0\.050\.46±\\pm0\.02Claude Haiku 4\.50\.54±\\pm0\.020\.48±\\pm0\.040\.48±\\pm0\.080\.50±\\pm0\.03Gemini Flash Lite 2\.50\.45±\\pm0\.050\.53±\\pm0\.050\.43±\\pm0\.070\.48±\\pm0\.01Claude Sonnet 4\.60\.38±\\pm0\.100\.46±\\pm0\.050\.41±\\pm0\.050\.42±\\pm0\.03Fine\-tunedDistilBERT0\.60±\\pm0\.060\.54±\\pm0\.100\.61±\\pm0\.040\.59±\\pm0\.03RoBERTa0\.69±\\pm0\.070\.62±\\pm0\.100\.56±\\pm0\.050\.62±\\pm0\.06To isolate the effect of model scale, we compare all zero\-shot models under a shared universal label schema\.[Table˜7](https://arxiv.org/html/2606.04274#S5.T7)summarises macro performance across topics, while[Table˜5](https://arxiv.org/html/2606.04274#S5.T5)provides a class\-level breakdown\.

Llama scaling is inconsistent\.Within the Llama family, scaling from 8B to 70B yields inconsistent gains: the 70B model achieves slightly higher overall macro\-F1F\_\{1\}\(0\.46 vs\. 0\.42;Δ=−0\.044\\Delta\{=\}\-0\.044;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(a\)\), but the two models are statistically indistinguishable in the Environment topic under UL \(0\.42 vs\. 0\.45;Δ=−0\.024\\Delta\{=\}\-0\.024,pHolm=1\.000p\_\{\\text\{Holm\}\}\{=\}1\.000;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(a\)\), and Llama\-3\-8B in fact overtakes Llama\-3\-70B in this topic once topic\-specific labels are introduced \(0\.53 vs\. 0\.43;[Table˜6](https://arxiv.org/html/2606.04274#S5.T6)\)\. The inconsistency is explained by uneven class behaviour: as shown in[Table˜5](https://arxiv.org/html/2606.04274#S5.T5), larger models tend to perform well on thefact\-checkcategory \(e\.g\., 0\.60 for Llama\-3\-70B\) but perform substantially worse onbelief\(0\.24\)\. In contrast, smaller models such as Llama\-3\.2\-3B exhibit more balanced belief detection \(0\.47\) but at the cost of collapsing onother\(0\.19\)\.

Sonnet underperforms Haiku\.The performance of proprietary models in[Table˜5](https://arxiv.org/html/2606.04274#S5.T5)sharpens this picture\. Claude Haiku 4\.5 achieves the highest zero\-shot macro\-F1F\_\{1\}under UL \(0\.50\), outperforming the larger Llama\-3\-70B \(0\.47\)\. More strikingly, Claude Sonnet 4\.6 — considered a more capable model than Haiku — achieves only 0\.42 macro\-F1F\_\{1\}under UL, the lowest among the proprietary models \(Δ=−0\.082\\Delta\{=\}\-0\.082versus Haiku;pHolm=0\.002p\_\{\\text\{Holm\}\}\{=\}0\.002;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(c\)\)\. This inversion is explained by class behaviour: Sonnet collapses belief detection to 0\.17, the lowest of any model, suggesting that the abstract UL phrase “spreads misinformation” is insufficient for Sonnet to reliably distinguish implicit belief from neutral commentary\.

Label–model alignment, not scale\.Taken together, class\-level prediction patterns are better explained by label–model alignment than by scale\. Models whose instruction\-tuning or safety training is stronger tend to favour the class most consistent with that training when the label wording is ambiguous\. This effect is most sharply visible in Claude Sonnet 4\.6: under UL, Sonnet not only collapsed belief detection \(0\.17, the lowest of any model\) but also produced outright refusals on a subset of comments flagged as sensitive content, requiring fallback to an earlier model version for those instances\. These refusals are consistent with safety alignment that makes the model reluctant to assert that a comment “spreads misinformation” without strong contextual evidence — precisely the inference thebeliefclass demands\. Topic\-specific labels partially correct this by providing concrete, claim\-anchored descriptions that reduce label ambiguity and give the model a more tractable basis for classification\.

These results show that model scale alone is not a reliable driver of zero\-shot performance in misinformation response classification; training paradigm and the interaction between instruction\-tuning and label wording are more decisive than parameter count\.

### 5\.3BART\-MNLI: a surprisingly strong zero\-shot baseline

Table 8:Computational cost of classification models\. For fine\-tuned models, values are averaged across cross\-validation folds per topic\. The Llama\-3\-70B model was executed in quantised form\.†The low reported GPU memory for Llama\-3\-70B reflects CPU–GPU offloading; the full quantised model requires substantially more memory when GPU\-resident\.‡Commercial models are accessed via cloud APIs; GPU memory is not applicable \(N/A\)\. Inference times are wall\-clock API response times \(including network round\-trip latency\) and are not directly comparable to the GPU inference times of local models\. API costs cover both UL and topic\-specific schema runs including retried calls; see[Section˜5\.3](https://arxiv.org/html/2606.04274#S5.SS3)for methodology\.ModelPeak GPU Mem\. \(GB\)Avg Infer\. Time \(s\)API Cost \(USD\)BART\-MNLI4\.450\.05$0\.00DistilBERT0\.830\.004$0\.00RoBERTa1\.480\.007$0\.00Llama\-3\.2\-3B7\.960\.21$0\.00Llama\-3\-8B20\.050\.20$0\.00Llama\-3\-70B4\.79 \(quantised†\)42\.42$0\.00Claude Haiku 4\.5‡N/A10\.48$1\.19Gemini Flash Lite 2\.5‡N/A11\.82$0\.16Claude Sonnet 4\.6‡N/A8\.96$2\.80Balanced predictions across classes\.Although larger generative models can achieve competitive performance in certain settings, BART\-MNLI provides a more stable and practically useful zero\-shot baseline\. As shown in[Tables˜5](https://arxiv.org/html/2606.04274#S5.T5)and[6](https://arxiv.org/html/2606.04274#S5.T6), BART maintains relatively balanced performance across response categories, in contrast to larger generative models — both open\-source Llama variants and commercial LLMs — which tend to favour fact\-check over belief \(with the smallest Llama variant \(3\.2\-3B\) being an exception that instead collapsesother\)\.

NLI architecture explains the symmetry\.This difference is consistent with the underlying model architecture\. BART\-MNLI operates within a natural language inference \(NLI\) framework, treating each label as a hypothesis to be evaluated against the input text\. This formulation encourages more symmetric decision boundaries across classes, particularly for distinguishing subjective belief from objective claims\. In contrast, generative models rely on prompt\-based completion, which can lead to implicit biases toward dominant or easily identifiable categories\.

Cost–deployment profiles\.From a practical perspective, the three model classes in[Table˜8](https://arxiv.org/html/2606.04274#S5.T8)present distinct cost–deployment profiles\. BART\-MNLI is the most resource\-efficient option across every dimension: 0\.05 s per inference, 4\.45 GB GPU memory, and zero marginal monetary cost once deployed\. Open\-source Llama models likewise incur no per\-query cost and preserve data privacy by running locally — advantages for large\-scale or sensitive deployments — but their high latency \(up to 42\.42 s for Llama\-3\-70B\) limits throughput\. Commercial API models require no local infrastructure and are straightforward to deploy; their costs were estimated from actual call logs covering both UL and topic\-specific runs, including retried calls from rate limits and label normalisation failures: Claude Haiku 4\.5 required 2,874 total calls \(1,190 UL \+ 1,684 TS\), Sonnet 4\.6 required 1,930 \(903 \+ 1,027\), and Gemini Flash Lite 2\.5 required 3,052 \(1,222 \+ 1,830\), all above the 1,800\-call baseline\. Applying published per\-token rates \($0\.80/$4\.00 per MTok input/output for Haiku; $3\.00/$15\.00 for Sonnet; $0\.10/$0\.40 for Gemini\), the total cost was $1\.19, $2\.80, and $0\.16 respectively — small enough on our 900\-comment corpus, but extrapolating to approximately $1\.32, $3\.11, and $0\.18 per 1,000 additional comments; at the scale of social\-media monitoring \(millions of posts\), these per\-query costs accumulate substantially\.

Three\-way trade\-off\.Model selection thus involves a three\-way trade\-off between accuracy, computational cost, and deployment flexibility\. BART\-MNLI offers the best combination of balanced class performance, negligible inference cost, and zero monetary overhead, making it the strongest practical*zero\-shot*baseline for misinformation response classification at scale\. Local open\-source models are a cost\-free and privacy\-preserving alternative suited to large\-scale deployments, at the price of higher latency\. Commercial models are well\-suited to rapid prototyping or small annotated corpora, but their per\-query pricing makes them impractical as a primary classifier for large\-scale or repeated use\. Whether any of these zero\-shot options is sufficient on its own, however, depends on a comparison with supervised fine\-tuning, which we turn to next\.

### 5\.4Fine\-tuning beats zero\-shot — especially on the belief class

The preceding subsections show that even the best zero\-shot configurations — larger Llama models, proprietary LLMs with topic\-specific labels — leave substantial room for improvement, particularly on thebeliefclass\. We now show that supervised fine\-tuning closes this gap, and that its advantage is concentrated precisely where zero\-shot models struggle most\.

Overall advantage\.Under the universal label schema, both fine\-tuned models outperform every zero\-shot model on overall macro\-F1F\_\{1\}\([Tables˜5](https://arxiv.org/html/2606.04274#S5.T5)and[7](https://arxiv.org/html/2606.04274#S5.T7)\)\. RoBERTa reaches 0\.62 macro\-F1F\_\{1\}, exceeding the best zero\-shot model \(Claude Haiku 4\.5, 0\.50\) byΔ=\+0\.12\\Delta\{=\}\+0\.12, a statistically significant gap \(pHolm=0\.002p\_\{\\text\{Holm\}\}\{=\}0\.002;[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)\(e\)\)\. DistilBERT, despite using only 67M parameters, also outperforms every zero\-shot model \(0\.59 macro\-F1F\_\{1\},Δ=\+0\.09\\Delta\{=\}\+0\.09over Haiku\)\. The advantage is consistent across topics: RoBERTa achieves the best fine\-tuned performance in Environment \(0\.69\) and Health \(0\.62\), while DistilBERT leads in Immigration \(0\.61\), and both models outperform every zero\-shot model in every topic except for one \(DistilBERT 0\.54 in Health is within fold\-variance of Gemini’s 0\.53\)\.

The advantage is largest on belief\.The clearest signal is in class\-level performance \([Table˜5](https://arxiv.org/html/2606.04274#S5.T5)\)\. Zero\-shot beliefF1F\_\{1\}under UL ranges from 0\.17 \(Sonnet\) to 0\.47 \(Llama\-3\.2\-3B\), but the latter trades belief detection for a collapsedotherclass \(0\.19\)\. The proprietary models that achieve balanced overall performance — Haiku, Gemini, Sonnet — detect belief at only 0\.34, 0\.32, and 0\.17 respectively\. Both fine\-tuned models reach beliefF1F\_\{1\}of 0\.52 without sacrificing performance on the other classes, lifting belief detection byΔ=\+0\.18\\Delta\{=\}\+0\.18over the best balanced zero\-shot model \(Haiku\) and byΔ=\+0\.35\\Delta\{=\}\+0\.35over Sonnet\. Fact\-check andotherperformance also improve under fine\-tuning \(RoBERTa reaches the best values in both: 0\.71 and 0\.64\), but the absolute gain on belief is the largest\.

A residual belief gap remains\.Even with supervision, belief detection lags fact\-check by a wide margin: RoBERTa achieves fact\-checkF1=0\.71F\_\{1\}=0\.71but beliefF1=0\.52F\_\{1\}=0\.52, a gap of 0\.19; DistilBERT shows a comparable gap \(0\.67 vs\. 0\.52\)\. Belief detection is also markedly less stable across folds: fine\-tuned belief shows standard deviations of 0\.11–0\.12, two to three times the variability observed on fact\-check \(0\.06–0\.07\) andother\(0\.03–0\.05\)\. This pattern suggests that belief detection is sensitive to which examples appear in training, consistent with a class whose surface cues are weak and instance\-specific\.

Why belief is harder\.The gap between fact\-check and belief is consistent across zero\-shot and fine\-tuned settings, across topics, and across architectures\. We interpret this as a linguistic asymmetry between the two classes\. Fact\-check responses tend to carry explicit lexical and semantic cues — claims of evidence \(“the source says …”\), direct contradictions \(“actually …”\), or references to verifiable facts — that map cleanly onto label descriptions and provide strong training signal\. Belief expressions are typically implicit, embedded in affective or sarcastic discourse, and require inference about the speaker’s stance rather than recognition of surface markers\. This asymmetry is intrinsic to the task: scaling model size does not close it \(as shown in[Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\), and fine\-tuning narrows but does not eliminate it\.

Implications\.These results show that improvements from supervision are not limited to dominant or easily\-identified classes, but extend to the most challenging minority class\. The gain on belief is what drives RoBERTa’s overall lead, and it is achieved at a fraction of the inference cost of the proprietary alternatives \([Table˜8](https://arxiv.org/html/2606.04274#S5.T8)\)\. Fine\-tuning is not dead: where labelled data is available, even modestly\-sized supervised models outperform much larger commercial LLMs on the task that matters — detecting the propagation of misinformation, not just its correction\. At the same time, the residual belief gap highlights a structural limitation of automated misinformation response classification: overall macro\-F1F\_\{1\}gains may continue to come disproportionately from easier classes, while subtle expressions of belief remain difficult to capture reliably even under supervision\.

## 6Discussion and Conclusion

As LLMs become default tools for information verification, it is easy to assume that scale and generality suffice for difficult classification tasks\. This paper tests that assumption\. We compared seven zero\-shot models \(BART\-MNLI, Llama\-3\.2\-3B, Llama\-3\-8B, Llama\-3\-70B, Claude Haiku 4\.5, Gemini Flash Lite 2\.5, Claude Sonnet 4\.6\) against fine\-tuned transformers \(DistilBERT, RoBERTa\) on 900 Reddit comments spanning three PolitiFact\-verified misinformation claims\. The assumption does not hold\. Fine\-tuned RoBERTa reaches 0\.62 macro\-F1F\_\{1\}against a best zero\-shot result of 0\.50 \(Claude Haiku 4\.5\), using only a few hundred labelled examples per topic and at a fraction of the inference cost\. The gap persists across topics, label schemas, and model sizes, and most of it comes from thebeliefclass, the implicit, affective category on which every zero\-shot model we tested under\-performs\.

LLMs as information verifiers: an asymmetric error problem\.These findings bear on a use case that is becoming common\. As LLMs are built into search engines, mobile assistants, and conversational interfaces, users increasingly delegate information verification to them: a user who encounters a claim online and asks an LLM whether it is true is requesting zero\-shot stance classification of the surrounding discourse\. Our results show a systematic failure mode in this setting\. Every commercial model we tested over\-predicts corrective responses and under\-detects belief\-propagating content\. Claude Sonnet 4\.6, the strongest of the three, drops beliefF1F\_\{1\}to 0\.17 under generic labels and refuses outright on a subset of comments it flags as sensitive\. In a verification context, missing belief is the costlier error: failing to flag a comment that propagates misinformation does more harm than missing a fact\-check\. For this task, fine\-tuned transformers such as RoBERTa are the more reliable choice\. They outperform every zero\-shot model we tested, including the proprietary LLMs, at a fraction of the per\-query cost and without the safety\-alignment refusals seen in Claude Sonnet 4\.6\.

A task\-sensitive approach\.The belief\-detection gap is linguistic rather than capacity\-bound\. Fact\-check responses carry explicit lexical markers, such as cited evidence and direct contradictions, that belief expressions embedded in affective or sarcastic discourse do not\. Scaling alone will not close this gap, and safety alignment in commercial LLMs can widen it by making the model reluctant to assert that a comment “spreads misinformation” without strong contextual evidence\. Label design, model architecture, and topic characteristics interact, and should be treated together rather than optimised in isolation\. Where labelled data are available, fine\-tuned transformers deliver higher accuracy at lower computational cost than large generative models; where they are not, NLI\-based models such as BART\-MNLI offer a practical, interpretable, and efficient baseline\.

Limitations\.Scope of claims\.Each topic in this study corresponds to a single PolitiFact\-verified claim: California water\-ownership \(environment\), the bleach\-cure \(health\), and the Kilmar García tattoo claim \(immigration\)\. The last involves a manipulated image, which makes its Reddit discourse qualitatively different from the text\-based misinformation in the other two topics\. Cross\-topic differences may therefore reflect claim\-specific discourse characteristics, such as vocabulary, register, and community norms, rather than domain\-level properties\. Our findings describe model behaviour on these three claims, and replication across multiple claims per domain is needed before broader domain\-level conclusions can be drawn\.

Paradigm comparison\.We do not fine\-tune Llama models or evaluate few\-shot or in\-context settings\. Our objective is to compare three modelling paradigms, namely NLI\-based zero\-shot inference, prompt\-based generative zero\-shot models, and supervised classifiers, under typical deployment constraints\. Fine\-tuning large generative models would add computational overhead and infrastructure requirements that are often impractical in large\-scale social media analysis; few\-shot prompting is a natural middle ground and a direction for future work\. We also classify each comment independently, without using Reddit’s threaded conversation structure, consistent with the LLM\-as\-verifier scenario in which a user submits an isolated claim for assessment\.

Model coverage\.The zero\-shot comparison covers BART\-MNLI, three Llama variants, and three commercial LLMs\. Broader coverage, for example DeBERTa\-MNLI, Mistral, and Gemma, would strengthen the generalisability of our conclusions about the relative importance of architecture versus scale\.

Future work and data availability\.Future work should replicate this analysis across multiple claims per domain to separate claim\-level from domain\-level effects, extend it to few\-shot and instruction\-tuned settings, incorporate Reddit’s threaded conversational structure, and develop annotation protocols that support multi\-annotator reliability across diverse misinformation topics\. The annotated dataset \(Reddit comment IDs and labels\) and classification code will be made publicly available upon acceptance\.

## References

- \\bibcommenthead
- Achiam et al\. \[2023\]Achiam J, Adler S, Agarwal S, et al \(2023\) Gpt\-4 technical report\. arXiv preprint arXiv:230308774
- Augenstein et al\. \[2016\]Augenstein I, Rocktäschel T, Vlachos A, et al \(2016\) Stance detection with bidirectional conditional encoding\. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)\. Association for Computational Linguistics, pp 876–885,[10\.18653/v1/D16\-1084](https://arxiv.org/doi.org/10.18653/v1/D16-1084)
- Bacher \[2025\]Bacher D \(2025\) Fact check: The resnicks do not own “most of california’s water”\. URL[https://www\.c\-win\.org/blog/2025/1/27/fact\-check\-the\-resnicks\-do\-not\-own\-most\-of\-californias\-water](https://www.c-win.org/blog/2025/1/27/fact-check-the-resnicks-do-not-own-most-of-californias-water)
- Bladt \[2025\]Bladt C \(2025\) Claims about who owns california’s water are spreading online\. here’s what to know\. URL[https://www\.cbsnews\.com/news/los\-angeles\-wildfires\-stewart\-resnick\-lynda\-resnick\-water\-rights/](https://www.cbsnews.com/news/los-angeles-wildfires-stewart-resnick-lynda-resnick-water-rights/)
- Boe \[2023\]Boe B \(2023\) Praw 7\.7\.1 documentation\.[https://praw\.readthedocs\.io/en/stable/](https://praw.readthedocs.io/en/stable/), accessed: 2025\-03\-04
- Brown et al\. \[2020\]Brown TB, Mann B, Ryder N, et al \(2020\) Language models are few\-shot learners\. In: Advances in Neural Information Processing Systems 33 \(NeurIPS 2020\)\. Curran Associates, Inc\., pp 1877–1901
- Calefati \[2020\]Calefati J \(2020\) On covid\-19, donald trump said that “maybe if you drank bleach you may be okay\.”\. URL[https://www\.politifact\.com/factchecks/2020/jul/11/joe\-biden/no\-trump\-didnt\-tell\-americans\-infected\-coronavirus/](https://www.politifact.com/factchecks/2020/jul/11/joe-biden/no-trump-didnt-tell-americans-infected-coronavirus/)
- Cinelli et al\. \[2021\]Cinelli M, De Francisci Morales G, Galeazzi A, et al \(2021\) The echo chamber effect on social media\. Proceedings of the National Academy of Sciences 118\(9\):e2023301118\.[10\.1073/pnas\.2023301118](https://arxiv.org/doi.org/10.1073/pnas.2023301118)
- Derczynski et al\. \[2017\]Derczynski L, Bontcheva K, Liakata M, et al \(2017\) SemEval\-2017 task 8: RumourEval: Determining rumour veracity and support for rumours\. In: Proceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\)\. Association for Computational Linguistics, pp 69–76,[10\.18653/v1/S17\-2006](https://arxiv.org/doi.org/10.18653/v1/S17-2006)
- Devlin et al\. \[2019\]Devlin J, Chang MW, Lee K, et al \(2019\) BERT: Pre\-training of deep bidirectional transformers for language understanding\. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\)\. Association for Computational Linguistics, pp 4171–4186,[10\.18653/v1/N19\-1423](https://arxiv.org/doi.org/10.18653/v1/N19-1423)
- Ferreira and Vlachos \[2016\]Ferreira W, Vlachos A \(2016\) Emergent: A novel data\-set for stance classification\. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\)\. Association for Computational Linguistics, pp 1163–1168,[10\.18653/v1/N16\-1138](https://arxiv.org/doi.org/10.18653/v1/N16-1138)
- Gorrell et al\. \[2019\]Gorrell G, Kochkina E, Liakata M, et al \(2019\) SemEval\-2019 task 7: RumourEval, determining rumour veracity and support for rumours\. In: Proceedings of the 13th International Workshop on Semantic Evaluation \(SemEval\-2019\)\. Association for Computational Linguistics, pp 845–854,[10\.18653/v1/S19\-2147](https://arxiv.org/doi.org/10.18653/v1/S19-2147)
- Guo et al\. \[2022\]Guo Z, Schlichtkrull M, Vlachos A \(2022\) A survey on automated fact\-checking\. Transactions of the Association for Computational Linguistics 10:178–206\.[10\.1162/tacl\_a\_00454](https://arxiv.org/doi.org/10.1162/tacl_a_00454)
- Hardalov et al\. \[2021\]Hardalov M, Arora A, Nakov P, et al \(2021\) A survey on stance detection for mis\- and disinformation identification\. In: Findings of the Association for Computational Linguistics: NAACL 2021\. Association for Computational Linguistics, pp 1259–1277,[10\.18653/v1/2021\.findings\-naacl\.324](https://arxiv.org/doi.org/10.18653/v1/2021.findings-naacl.324), venue corrected: the review body states EMNLP 2021 but the DOI and ACL Anthology confirm NAACL 2021 Findings
- Hazarika et al\. \[2018\]Hazarika D, Poria S, Gorantla S, et al \(2018\) CASCADE: Contextual sarcasm detection in online discussion forums\. In: Proceedings of the 27th International Conference on Computational Linguistics \(COLING\)\. Association for Computational Linguistics, pp 1837–1848
- Hedderich et al\. \[2021\]Hedderich MA, Lange L, Adel H, et al \(2021\) A survey on recent approaches for natural language processing in low\-resource scenarios\. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\)\. Association for Computational Linguistics, pp 2545–2568,[10\.18653/v1/2021\.naacl\-main\.201](https://arxiv.org/doi.org/10.18653/v1/2021.naacl-main.201)
- Jacobson \[2025\]Jacobson L \(2025\) Kilmar armando abrego garcia “had ‘ms\-13’ on his knuckles tattooed\. … he had ‘ms’ as clear as you can be\. not ’interpreted\.’”\. URL[https://www\.politifact\.com/factchecks/2025/apr/30/donald\-trump/trump\-abrego\-garcia\-hand\-tattoos\-abc\-news/](https://www.politifact.com/factchecks/2025/apr/30/donald-trump/trump-abrego-garcia-hand-tattoos-abc-news/)
- Joshi et al\. \[2017\]Joshi A, Bhattacharyya P, Carman MJ \(2017\) Automatic sarcasm detection: A survey\. ACM Computing Surveys 50\(5\):1–22\.[10\.1145/3124420](https://arxiv.org/doi.org/10.1145/3124420)
- Karande et al\. \[2021\]Karande H, Walambe R, Benjamin V, et al \(2021\) Stance detection with BERT embeddings for credibility analysis of information on social media\. PeerJ Computer Science 7:e467\.[10\.7717/peerj\-cs\.467](https://arxiv.org/doi.org/10.7717/peerj-cs.467)
- Kawintiranon and Singh \[2021\]Kawintiranon K, Singh L \(2021\) Knowledge enhanced masked language model for stance detection\. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\)\. Association for Computational Linguistics, pp 4725–4735,[10\.18653/v1/2021\.naacl\-main\.376](https://arxiv.org/doi.org/10.18653/v1/2021.naacl-main.376)
- Kochkina et al\. \[2017\]Kochkina E, Liakata M, Augenstein I \(2017\) Turing at SemEval\-2017 task 8: Sequential approach to rumour stance classification with Branch\-LSTM\. In: Proceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\)\. Association for Computational Linguistics, pp 475–480,[10\.18653/v1/S17\-2083](https://arxiv.org/doi.org/10.18653/v1/S17-2083)
- Lewis et al\. \[2020\]Lewis M, Liu Y, Goyal N, et al \(2020\) BART: Denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\)\. Association for Computational Linguistics, pp 7871–7880,[10\.18653/v1/2020\.acl\-main\.703](https://arxiv.org/doi.org/10.18653/v1/2020.acl-main.703)
- Liu et al\. \[2019\]Liu Y, Ott M, Goyal N, et al \(2019\) RoBERTa: A robustly optimized BERT pretraining approach\. arXiv preprint arXiv:190711692 URL[https://arxiv\.org/abs/1907\.11692](https://arxiv.org/abs/1907.11692)
- McCullough \[2025\]McCullough C \(2025\) No, one couple doesn’t own almost all california’s water\. URL[https://www\.politifact\.com/factchecks/2025/jan/14/more\-perfect\-union/does\-a\-billionaire\-couple\-own\-almost\-all\-the\-water/](https://www.politifact.com/factchecks/2025/jan/14/more-perfect-union/does-a-billionaire-couple-own-almost-all-the-water/)
- Mohammad et al\. \[2016\]Mohammad S, Kiritchenko S, Sobhani P, et al \(2016\) SemEval\-2016 task 6: Detecting stance in tweets\. In: Proceedings of the 10th International Workshop on Semantic Evaluation \(SemEval\-2016\)\. Association for Computational Linguistics, pp 31–41,[10\.18653/v1/S16\-1003](https://arxiv.org/doi.org/10.18653/v1/S16-1003)
- Mohammad et al\. \[2017\]Mohammad S, Sobhani P, Kiritchenko S \(2017\) Stance and sentiment in tweets\. ACM Transactions on Internet Technology 17\(3\):1–23\.[10\.1145/3003433](https://arxiv.org/doi.org/10.1145/3003433)
- Morrow \[2022\]Morrow S \(2022\) How this billionaire couple stole california’s water supply\. URL[https://perfectunion\.us/how\-this\-billionaire\-couple\-stole\-californias\-water\-supply/](https://perfectunion.us/how-this-billionaire-couple-stole-californias-water-supply/)
- PolitiFact \[2025\]PolitiFact \(2025\) Politifact\. URL[https://www\.politifact\.com/](https://www.politifact.com/)
- Rascouët\-Paz \[2025\]Rascouët\-Paz A \(2025\) No, billionaire couple does not “own most of california’s water”\. URL[https://www\.snopes\.com/news/2025/01/16/billonaire\-couple\-own\-californias\-water/](https://www.snopes.com/news/2025/01/16/billonaire-couple-own-californias-water/)
- Sanh et al\. \[2019\]Sanh V, Debut L, Chaumond J, et al \(2019\) DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter\. arXiv preprint arXiv:191001108 URL[https://arxiv\.org/abs/1910\.01108](https://arxiv.org/abs/1910.01108), NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing
- Schick and Schütze \[2021\]Schick T, Schütze H \(2021\) Exploiting cloze\-questions for few\-shot text classification and natural language inference\. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\)\. Association for Computational Linguistics, pp 255–269,[10\.18653/v1/2021\.eacl\-main\.20](https://arxiv.org/doi.org/10.18653/v1/2021.eacl-main.20)
- Shu et al\. \[2017\]Shu K, Sliva A, Wang S, et al \(2017\) Fake news detection on social media: A data mining perspective\. ACM SIGKDD Explorations Newsletter 19\(1\):22–36\.[10\.1145/3137597\.3137600](https://arxiv.org/doi.org/10.1145/3137597.3137600)
- Specht \[2024\]Specht P \(2024\) Says donald trump “told americans all they had to do was inject bleach in themselves\. just take a shot of uv light\.”\. URL[https://www\.politifact\.com/factchecks/2024/mar/28/joe\-biden/biden\-exaggerates\-trumps\-pandemic\-comments\-about\-d/](https://www.politifact.com/factchecks/2024/mar/28/joe-biden/biden-exaggerates-trumps-pandemic-comments-about-d/)
- Team et al\. \[2023\]Team G, Anil R, Borgeaud S, et al \(2023\) Gemini: a family of highly capable multimodal models\. arXiv preprint arXiv:231211805
- Thorne et al\. \[2018\]Thorne J, Vlachos A, Christodoulopoulos C, et al \(2018\) FEVER: A large\-scale dataset for fact extraction and VERification\. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\)\. Association for Computational Linguistics, pp 809–819,[10\.18653/v1/N18\-1074](https://arxiv.org/doi.org/10.18653/v1/N18-1074)
- Touvron et al\. \[2023\]Touvron H, et al \(2023\) Llama: Open and efficient foundation language models\. arXiv preprint arXiv:230213971
- Vosoughi et al\. \[2018\]Vosoughi S, Roy D, Aral S \(2018\) The spread of true and false news online\. Science 359\(6380\):1146–1151\.[10\.1126/science\.aap9559](https://arxiv.org/doi.org/10.1126/science.aap9559)
- Wei and Zou \[2019\]Wei J, Zou K \(2019\) EDA: Easy data augmentation techniques for boosting performance on text classification tasks\. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)\. Association for Computational Linguistics, pp 6382–6388,[10\.18653/v1/D19\-1670](https://arxiv.org/doi.org/10.18653/v1/D19-1670)
- Weld et al\. \[2021\]Weld G, Glenski M, Althoff T \(2021\) Political bias and factualness in news sharing across more than 100,000 online communities\. In: Proceedings of the 15th International AAAI Conference on Web and Social Media \(ICWSM\), vol 15\. AAAI Press, pp 796–807,[10\.1609/icwsm\.v15i1\.18104](https://arxiv.org/doi.org/10.1609/icwsm.v15i1.18104)
- Williams et al\. \[2018\]Williams A, Nangia N, Bowman S \(2018\) A broad\-coverage challenge corpus for sentence understanding through inference\. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\)\. Association for Computational Linguistics, pp 1112–1122, URL[http://aclweb\.org/anthology/N18\-1101](http://aclweb.org/anthology/N18-1101)
- Yin et al\. \[2019\]Yin W, Hay J, Roth D \(2019\) Benchmarking zero\-shot text classification: Datasets, evaluation and entailment approach\. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)\. Association for Computational Linguistics, pp 3914–3923,[10\.18653/v1/D19\-1404](https://arxiv.org/doi.org/10.18653/v1/D19-1404)
- Yu et al\. \[2020\]Yu J, Jiang J, Khoo LMS, et al \(2020\) Coupled hierarchical transformer for stance\-aware rumor verification in social media conversations\. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)\. Association for Computational Linguistics, pp 1392–1401,[10\.18653/v1/2020\.emnlp\-main\.108](https://arxiv.org/doi.org/10.18653/v1/2020.emnlp-main.108)
- Zubiaga et al\. \[2016\]Zubiaga A, Liakata M, Procter R, et al \(2016\) Analysing how people orient to and spread rumours in social media by looking at conversational threads\. PLoS ONE 11\(3\):e0150989\.[10\.1371/journal\.pone\.0150989](https://arxiv.org/doi.org/10.1371/journal.pone.0150989)

## Appendix AAppendix: Misinformation Topic Background

This appendix provides detailed context for the three misinformation claims used in the study\. The narratives below describe the origin of each claim, its fact\-checking history, and the circumstances under which it circulated or re\-emerged on Reddit\. This contextual information, together with the thematic patterns in[Table˜9](https://arxiv.org/html/2606.04274#A1.T9), informed the design of topic\-specific annotation guidelines \([Section˜3\.2](https://arxiv.org/html/2606.04274#S3.SS2)\) and zero\-shot label sets \([Section˜4](https://arxiv.org/html/2606.04274#S4)\)\.

### Water Ownership in California \(Water Scandal\)

The first topic is the claim that “One billionaire couple owns almost all the water in California”\[[24](https://arxiv.org/html/2606.04274#bib.bib24)\], fact\-checked by PolitiFact on 14 January 2025 and rated as False\. Independent investigations by Snopes\[[29](https://arxiv.org/html/2606.04274#bib.bib29)\], the California Water Impact Network\[[3](https://arxiv.org/html/2606.04274#bib.bib3)\], and CBS News\[[4](https://arxiv.org/html/2606.04274#bib.bib4)\]all refuted the claim\. Stewart and Lynda Resnick own a majority stake in a significant California water bank, but that represents only a small fraction of the state’s total water supply\.

The original video propagating the claim was published in 2022 by a group called More Perfect Union\[[27](https://arxiv.org/html/2606.04274#bib.bib27)\]\. It regained significant traction during the early 2025 California wildfires, when online users circulated it widely to assign blame for alleged water shortages affecting firefighting efforts\. This resurgence illustrates how crisis events can revive dormant misinformation\.

### COVID “Bleach Cure” Claim

The second topic concerns a recurring mischaracterisation of remarks made by former President Donald Trump during the COVID\-19 pandemic\. In 2020, Joe Biden asserted that “Donald Trump said that maybe if you drank bleach you may be okay”\[[7](https://arxiv.org/html/2606.04274#bib.bib7)\]; he reiterated a similar formulation in 2024, claiming Trump “told Americans all they had to do was inject bleach in themselves\. Just take a shot of UV light”\[[33](https://arxiv.org/html/2606.04274#bib.bib33)\]\. Both statements were rated Mostly False by PolitiFact\.

During an April 2020 White House briefing, Trump speculated about whether disinfectants or ultraviolet light could be used inside the body as treatments; he did not explicitly instruct the public to ingest or inject bleach\. His remarks were widely criticised for their ambiguity and potential public health risk\. The claim resurfaced in 2025 following Trump’s controversial statements suggesting a link between Tylenol and autism, reigniting discussion around misinformation in public health discourse\.

### Misrepresented Gang Tattoo Imagery

The third topic concerns a claim propagated by Donald Trump in April 2025 regarding Kilmar Armando Ábrego García, a Salvadoran man\. Trump posted an image on Truth Social showing García’s hand annotated with markings resembling the letters “MS13,” implying membership in the MS\-13 gang\. He later reinforced this in an ABC interview, asserting that García “had ‘MS\-13’ on his knuckles tattooed …He had ‘MS’ as clear as you can be\. Not interpreted”\[[17](https://arxiv.org/html/2606.04274#bib.bib17)\]\.

PolitiFact rated the claim Pants on Fire\. Analysis revealed that the “MS13” letters were not part of García’s actual tattoos but were added as digital overlays\. Experts in gang symbolism noted that the tattoos shown do not correspond to known MS\-13 iconography, and similar designs are commonly worn by individuals unaffiliated with gangs\. The false association was reportedly used to justify García’s deportation, illustrating the tangible legal and social consequences of visual misinformation\.

Table 9:Linguistic and rhetorical patterns that characterise each stance class across the three topics, as observed during annotationTopicBeliefFact\-CheckOtherWater ScandalExpression of moral outrage and strong emotional reactions\.Criticism of institutions \(e\.g\., government, elites, corporations\)\.Endorsement or gratitude for the information presented\.Calls to action such as boycotts\.Presentation of evidence\-based statistics or factual information\.Scepticism about the plausibility of claims\.Highlighting exaggeration or misleading framing\.Identification of misleading video titles\.Discussion of tangential topics \(e\.g\., water rights, wildfires\)\.Commentary on the video or featured individuals\.Political commentary unrelated to factual accuracy\.Humour or cultural references\.COVID “Bleach Cure”Propagation and insistence on the claim’s truth\.Mockery or ridicule\.Suggesting truth for political or comedic purposes\.Appeals to outrage or consequences\.Debunking misinformation\.Providing contextual clarification\.Accurate citation of original statements\.Criticism of media misinterpretation\.Links to fact\-checking sources\.Discussion of elections or voting processes\.Partisan criticism unrelated to claim accuracy\.Humour or satire\.Gang Tattoo ImageryAffirmation of tattoo authenticity\.Defense of Trump’s stance regarding the image\.Claims of MS\-13 membership based on imagery\.Identification of photoshopping or annotation\.Highlighting deliberate misrepresentation\.Reference to expert or authoritative judgments\.Discussion of tangential topics \(e\.g\., due process, gang symbolism\)\.Political commentary unrelated to verification\.Opinions on Trump’s character or actions\.

## Appendix BGPT\-4o Pre\-Filter Bias Analysis

This section reports the empirical analysis used to characterise the potential selection bias introduced by the GPT\-4o pre\-filtering step described in[Section˜3\.2](https://arxiv.org/html/2606.04274#S3.SS2)\.

Design\.Two pools of 900 comments were compared: \(1\) the*labelled pool*— the 900 GPT\-4o\-pre\-filtered, human\-annotated comments used throughout the study; and \(2\) a*random pool*— 900 comments drawn uniformly at random from the remaining unlabelled comments in the same raw corpus \(300 per topic, stratified\), applying the same validity filter \(minimum three words; excluding deleted or removed posts\)\.

BART confidence comparison\.BART\-MNLI was run on the random pool using the same universal label set and prompting procedure as in the main evaluation\. For each comment in both pools, the model’s max\-confidence score \(the probability assigned to the top predicted label\) serves as a proxy for LLM “easiness”: comments that are easier for LLMs to classify receive higher confidence scores\.[Table˜10](https://arxiv.org/html/2606.04274#A2.T10)summarises the results\. The labelled pool has marginally higher mean confidence \(0\.62 vs\. 0\.60\), and the difference is statistically significant \(Mann\-WhitneyUU,p=0\.004p=0\.004; two\-sided KS test,D=0\.073D=0\.073,p=0\.016p=0\.016\)\. However, the effect size is negligible \(Cohen’sd=0\.13d=0\.13, well below the conventional small\-effect threshold of 0\.2\), indicating that the difference has no practical consequence for the validity of the LLM comparisons\.

Table 10:BART\-MNLI max\-confidence comparison between the labelled pool \(GPT\-4o pre\-filtered\) and a random unlabelled sample from the same corpus\.PoolnnMeanMedianCohen’sddLabelled \(GPT\-4o pre\-filtered\)9000\.6200\.5980\.13 \(negligible\)Random \(unlabelled\)9000\.6010\.578Inter\-model agreement\.As a complementary measure, we computed the plurality agreement among all seven zero\-shot models on the 900 labelled comments under the universal label schema\. Only 25\.2% of comments achieved agreement among≥\\geq6 of 7 models \(mean agreement: 4\.71/7\), confirming that the labelled pool is not predominantly composed of trivially classifiable examples\.

[Fig\.˜1](https://arxiv.org/html/2606.04274#A2.F1)shows the BART confidence distributions for both pools \(panel a\) and the inter\-model agreement histogram for the labelled pool \(panel b\)\.

![Refer to caption](https://arxiv.org/html/2606.04274v1/x1.png)Figure 1:\(a\) BART\-MNLI max\-confidence distributions for the labelled pool \(blue\) and the random unlabelled pool \(orange\)\. Dashed vertical lines mark the respective means\. The difference is statistically significant \(Mann\-Whitneyp=0\.004p=0\.004\) but negligible in effect size \(Cohen’sd=0\.13d=0\.13\)\. \(b\) Inter\-model plurality agreement across all seven zero\-shot models for the 900 labelled comments\. Green bars indicate comments where≥\\geq6/7 models agree \(“easy”\); only 25\.2% of comments meet this threshold\.
## Appendix CStatistical Significance of Pairwise Comparisons

This section reports two complementary significance analyses for the headline pairwise comparisons discussed in the main text\. The first uses a permutation test on all 900 individual predictions, providing greater statistical power\. The second uses a paired Wilcoxon signed\-rank test on the five per\-fold macro\-F1F\_\{1\}values, matching the cross\-validation design used throughout\.

### C\.1Permutation Test on All 900 Predictions

[Table˜11](https://arxiv.org/html/2606.04274#A3.T11)reports a two\-sided permutation test conducted directly on all 900 per\-item predictions\. This nonparametric approach requires no distributional assumptions and provides substantially greater statistical power when item\-level predictions are available\.

Procedure\.Lety^A=\(y^1A,…,y^nA\)\\hat\{y\}^\{A\}=\(\\hat\{y\}^\{A\}\_\{1\},\\ldots,\\hat\{y\}^\{A\}\_\{n\}\)andy^B=\(y^1B,…,y^nB\)\\hat\{y\}^\{B\}=\(\\hat\{y\}^\{B\}\_\{1\},\\ldots,\\hat\{y\}^\{B\}\_\{n\}\)denote the predicted three\-class labels \(belief / fact\-check / other\) of models A and B over thennrelevant items \(n=900n\{=\}900for overall comparisons;n≈300n\{\\approx\}300for per\-topic comparisons\), and lety=\(y1,…,yn\)y=\(y\_\{1\},\\ldots,y\_\{n\}\)denote the corresponding gold labels\. For fine\-tuned models, the full 900\-item prediction vectors are reconstructed by concatenating per\-fold test\-set predictions, with each item appearing in exactly one fold’s test set\. The observed statistic isΔobs=F1​\(y^A,y\)−F1​\(y^B,y\)\\Delta\_\{\\mathrm\{obs\}\}=F\_\{1\}\(\\hat\{y\}^\{A\},y\)\-F\_\{1\}\(\\hat\{y\}^\{B\},y\), whereF1F\_\{1\}denotes macro\-F1F\_\{1\}over the three classes\.

Under the null hypothesis that models A and B are exchangeable, the assignment of predictions to models carries no information\. Each permutation independently draws a binary coin for every itemii: with probability0\.50\.5, itemii’s predictions are swapped, replacing\(y^iA,y^iB\)\(\\hat\{y\}^\{A\}\_\{i\},\\,\\hat\{y\}^\{B\}\_\{i\}\)with\(y^iB,y^iA\)\(\\hat\{y\}^\{B\}\_\{i\},\\,\\hat\{y\}^\{A\}\_\{i\}\); with probability0\.50\.5they are left unchanged\. The gold labelsyiy\_\{i\}are never permuted\. After each of theN=10,000N\{=\}10\{,\}000independent permutations, the permuted differenceΔperm\\Delta\_\{\\mathrm\{perm\}\}is recomputed from the resulting prediction vectors\. The two\-sidedpp\-value is the fraction of permutations for which\|Δperm\|≥\|Δobs\|\|\\Delta\_\{\\mathrm\{perm\}\}\|\\geq\|\\Delta\_\{\\mathrm\{obs\}\}\|, floored at1/N=0\.00011/N=0\.0001to avoid an exact zero when no permutation meets the threshold\. Holm–Bonferroni correction is applied simultaneously across all 17 comparisons\.

Table 11:Statistical significance of headline pairwise comparisons \(two\-sided permutation test on macro\-F1F\_\{1\},N=10,000N\{=\}10\{,\}000permutations; Holm–Bonferroni correction across all 17 comparisons\)\.Δ\\Deltamacro\-F1F\_\{1\}= observed difference \(positive favours the first\-named model\)\. n\.d\.s\. = not demonstrably significant \(pHolm≥0\.05p\_\{\\text\{Holm\}\}\\geq 0\.05\)\.ComparisonΔ\\Deltamacro\-F1F\_\{1\}pppHolmp\_\{\\text\{Holm\}\}Result\(a\) Llama model scale \([Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\)Llama\-3\-8B vs Llama\-3\-70B \(overall\)−\-0\.0440\.0140\.128n\.d\.s\.Llama\-3\-8B vs Llama\-3\-70B \(environment\)−\-0\.0240\.4451\.000n\.d\.s\.Llama\-3\-8B vs Llama\-3\-70B \(health\)−\-0\.0380\.2541\.000n\.d\.s\.Llama\-3\-8B vs Llama\-3\-70B \(immigration\)−\-0\.0710\.0130\.128n\.d\.s\.\(b\) Haiku vs\. Gemini \([Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\)Claude Haiku 4\.5 vs Gemini Flash Lite 2\.5 \(overall\)\+\+0\.0230\.1600\.840n\.d\.s\.\(c\) Sonnet vs\. Haiku \([Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\)Claude Sonnet 4\.6 vs Claude Haiku 4\.5 \(overall\)−\-0\.082<<0\.0010\.002p<0\.01p<0\.01\(d\) RoBERTa vs\. DistilBERT \([Section˜5\.4](https://arxiv.org/html/2606.04274#S5.SS4)\)RoBERTa vs DistilBERT \(overall\)\+\+0\.0300\.1070\.747n\.d\.s\.\(e\) Fine\-tuned vs\. best zero\-shot \([Section˜5\.4](https://arxiv.org/html/2606.04274#S5.SS4)\)RoBERTa vs Claude Haiku 4\.5 \(overall\)\+\+0\.123<<0\.0010\.002p<0\.01p<0\.01RoBERTa vs Claude Haiku 4\.5 \(environment\)\+\+0\.157<<0\.0010\.002p<0\.01p<0\.01RoBERTa vs Claude Haiku 4\.5 \(health\)\+\+0\.149<<0\.0010\.002p<0\.01p<0\.01RoBERTa vs Claude Haiku 4\.5 \(immigration\)\+\+0\.0700\.0780\.622n\.d\.s\.\(f\) Topic\-specific vs\. universal label schema \([Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\)BART\-MNLI E1 vs UL \(environment\)\+\+0\.225<<0\.0010\.002p<0\.01p<0\.01BART\-MNLI H1 vs UL \(health\)\+\+0\.0320\.4241\.000n\.d\.s\.BART\-MNLI I1 vs UL \(immigration\)\+\+0\.0020\.9451\.000n\.d\.s\.Claude Sonnet 4\.6 E1 vs UL \(environment\)\+\+0\.272<<0\.0010\.002p<0\.01p<0\.01Claude Sonnet 4\.6 H1 vs UL \(health\)\+\+0\.0590\.1400\.840n\.d\.s\.Claude Sonnet 4\.6 I1 vs UL \(immigration\)\+\+0\.126<<0\.0010\.008p<0\.01p<0\.01
### C\.2Wilcoxon Signed\-Rank Test Across Folds

[Table˜12](https://arxiv.org/html/2606.04274#A3.T12)reports paired two\-sided Wilcoxon signed\-rank tests for the headline pairwise comparisons\. Per\-fold macro\-F1F\_\{1\}values are derived from the same stratified 5\-fold cross\-validation splits used throughout\. For zero\-shot models, predictions are fixed and the five per\-foldF1F\_\{1\}values reflect variation in test\-set composition only\. For fine\-tuned models, per\-foldF1F\_\{1\}values are read directly from the prediction files generated during training\.

Power caveat\.Withn=5n\{=\}5folds, the paired two\-sided Wilcoxon signed\-rank test has a minimum achievablepp\-value of0\.06250\.0625, which strictly exceeds the conventionalα=0\.05\\alpha\{=\}0\.05threshold\. Consequently, no pairwise comparison can achieve formal significance underα=0\.05\\alpha\{=\}0\.05with this test configuration, regardless of effect size\. This is a structural limitation of evaluating on five folds, not an indication that the observed differences are absent\. All comparisons are therefore reported as directionally consistent \(or inconsistent\) across folds, with theWWstatistic and rawpp\-value provided for transparency\. Holm–Bonferroni correction is applied across all comparisons; correctedpp\-values are shown aspHolmp\_\{\\text\{Holm\}\}\.

Table 12:Statistical significance of headline pairwise comparisons across 5\-fold cross\-validation splits \(paired two\-sided Wilcoxon signed\-rank test; Holm–Bonferroni correction applied across all 17 comparisons\)\.Δ\\Deltamacro\-F1F\_\{1\}= mean per\-fold difference \(positive favours the first\-named model\)\. Rankings are directionally consistent across all folds\.ComparisonΔ\\Deltamacro\-F1F\_\{1\}WWpppHolmp\_\{\\text\{Holm\}\}\(a\) Llama model scale \([Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\)Llama\-3\-8B vs Llama\-3\-70B \(overall\)−\-0\.04800\.0631\.000Llama\-3\-8B vs Llama\-3\-70B \(environment\)−\-0\.01171\.0001\.000Llama\-3\-8B vs Llama\-3\-70B \(health\)−\-0\.06130\.3131\.000Llama\-3\-8B vs Llama\-3\-70B \(immigration\)−\-0\.07510\.1251\.000\(b\) Haiku vs\. Gemini \([Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\)Claude Haiku 4\.5 vs Gemini Flash Lite 2\.5 \(overall\)\+\+0\.02510\.1251\.000\(c\) Sonnet vs\. Haiku \([Section˜5\.2](https://arxiv.org/html/2606.04274#S5.SS2)\)Claude Sonnet 4\.6 vs Claude Haiku 4\.5 \(overall\)−\-0\.08100\.0631\.000\(d\) RoBERTa vs\. DistilBERT \([Section˜5\.4](https://arxiv.org/html/2606.04274#S5.SS4)\)RoBERTa vs DistilBERT \(overall\)\+\+0\.03420\.1881\.000\(e\) Fine\-tuned vs\. best zero\-shot \([Section˜5\.4](https://arxiv.org/html/2606.04274#S5.SS4)\)RoBERTa vs Claude Haiku 4\.5 \(overall\)\+\+0\.12300\.0631\.000RoBERTa vs Claude Haiku 4\.5 \(environment\)\+\+0\.16400\.0631\.000RoBERTa vs Claude Haiku 4\.5 \(health\)\+\+0\.14800\.0631\.000RoBERTa vs Claude Haiku 4\.5 \(immigration\)\+\+0\.06400\.0631\.000\(f\) Topic\-specific vs\. universal label schema \([Section˜5\.1](https://arxiv.org/html/2606.04274#S5.SS1)\)BART\-MNLI E1 vs UL \(environment\)\+\+0\.24300\.0631\.000BART\-MNLI H1 vs UL \(health\)\+\+0\.04830\.3121\.000BART\-MNLI I1 vs UL \(immigration\)−\-0\.00671\.0001\.000Claude Sonnet 4\.6 E1 vs UL \(environment\)\+\+0\.28300\.0631\.000Claude Sonnet 4\.6 H1 vs UL \(health\)\+\+0\.04120\.1881\.000Claude Sonnet 4\.6 I1 vs UL \(immigration\)\+\+0\.13100\.0631\.000

Similar Articles