On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

arXiv cs.CL 06/02/26, 04:00 AM Papers
llm zero-shot annotation priors adaptability prompt-steerability toxicity-detection
Summary
This paper investigates how LLMs' internal priors affect zero-shot annotation performance, finding that nearly two-thirds of errors resist prompt-based correction and introducing Definition-Specific Familiarity as a better predictor than memorization metrics.
arXiv:2606.00467v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:37 PM
# On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Source: [https://arxiv.org/html/2606.00467](https://arxiv.org/html/2606.00467)
###### Abstract

Large Language Models \(LLMs\) are increasingly used for zero\-shot annotation and LLM\-as\-a\-judge tasks, yet their reliability hinges on how model\-internalized priors interact with user\-provided instructions\. We investigate three dimensions of this interaction: \(1\) how an LLM’s familiarity with data and task definitions affects performance, \(2\) the extent to which additional information in prompts can correct zero\-shot errors \(“decision stickiness”\), and \(3\) model susceptibility to misaligned task definitions\. Through experiments on toxicity detection across diverse datasets \(spanning social media, gaming, news, and forums\) using both dense and mixture\-of\-experts models, we find that nearly two\-thirds of zero\-shot errors are resistant to correction, with an overall rescue rate \(fraction of initial errors corrected by prompting\) of only 34\.8%\. High\-confidence errors prove especially resistant to correction\. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition\. Crucially, we introduce Definition\-Specific Familiarity \(DSF\), which measures alignment between a model’s internal concept and the task definition\. After controlling for dataset\-level confounds, DSF shows a positive association with model performance \(partialr=\+0\.41r=\+0\.41\), while three distinct memorization metrics \(ROUGE\-L, BERTScore, and embedding cosine similarity\) all fail to show a positive association\. These findings show the limitations of prompt\-based correction in annotation tasks, highlighting the importance of definition alignment over text\-level memorization\.

Large Language Models, Model\-Internalized Priors, Zero\-Shot Learning, Prompt Steerability, Annotation Reliability, Model Controllability

![Refer to caption](https://arxiv.org/html/2606.00467v1/x1.png)Figure 1:Study overview and research questions\.*Left:*zero\-shot annotation setup: given an input text and a user\-provided task definition, the LLM produces a label prediction\.*Right:*we study three facets of the interaction between model\-internalized task concepts and user instructions\.RQ1tests whether task familiarity correlates with performance, contrasting*definition alignment*\(Definition\-Specific Familiarity; DSF\) with*text memorization*metrics \(ROUGE\-L, BERTScore, embedding similarity; Table[4](https://arxiv.org/html/2606.00467#S4.T4)\)\.RQ2measures steerability under an aligned definition by asking whether zero\-shot errors can be*rescued*\(rescue rate\), and how this depends on confidence\.RQ3probes susceptibility to*misaligned*definitions, testing how performance and confidence change when the provided definition is incorrect\. Insets summarize the main empirical findings for each question\.## 1Introduction

Large Language Models \(LLMs\) are increasingly deployed for zero\-shot annotation\(Gilardi et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib12)\)and rubric\-based evaluation \(LLM\-as\-a\-judge\) tasks\(Zheng et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib52); Liu et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib24)\)\. These use cases are attractive: users provide a task definition via prompt, and the model annotates at scale without labeled training data\(Gilardi et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib12); Brown et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib3)\)\. The implicit assumption is that the user’s definition governs the model’s behavior\.

But LLMs are not blank slates\. Through pre\-training on web\-scale corpora\(Brown et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib3)\), followed by instruction tuning and reinforcement learning from human feedback \(RLHF\), models develop implicit understandings of common concepts and tasks\. When a user asks an LLM to detect “toxic” content, the model has already been shaped by millions of documents and human\-rated examples that discuss, define, and exemplify toxicity, and this exposure shapes an internal concept of the task\. Critically, this internalized concept may not align with the user’s intended definition: toxicity is operationalized differently across applications, as identity attacks\(Borkan et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib2)\), disruptive behavior in online gaming\(Kocielnik et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib20),[2025b](https://arxiv.org/html/2606.00467#bib.bib22)\), or as speech targeting protected groups\(ElSherief et al\.,[2018](https://arxiv.org/html/2606.00467#bib.bib9)\)\. An LLM anchored to a particular understanding may not simultaneously match all of these definitions, and its internalized priors may constrain how much user instructions can steer its behavior and alignment with the user’s intended interpretation\(Min et al\.,[2022](https://arxiv.org/html/2606.00467#bib.bib27)\)\. This disconnect between a model’s internalized concepts and user\-specified definitions represents a fundamental alignment challenge\.

When the user’s task definition conflicts with the model’s internalized concept, which one governs behavior? Prior work has shown that LLMs exhibit biases reflecting the demographics and viewpoints in their training data — both at the pre\-training stage and after instruction tuning or RLHF\(Nadeem et al\.,[2021](https://arxiv.org/html/2606.00467#bib.bib30); Nangia et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib31); Parrish et al\.,[2022](https://arxiv.org/html/2606.00467#bib.bib35); Santurkar et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib39)\)\. Prompting and reasoning techniques can inadvertently amplify such latent biases\(Shaikh et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib41)\)\. Less is known about the degree to which users can override a model’s internalized task understanding through prompting, or how to measure the alignment between a model’s internal concept and a task definition\.

We investigate this tension through three research questions \(Figure[1](https://arxiv.org/html/2606.00467#S0.F1)\):

- •RQ1 \(Familiarity and Performance\):How does an LLM’s familiarity with the data and task definition affect annotation performance?
- •RQ2 \(Decision Stickiness\):To what extent can external knowledge or specialized prompting correct initial zero\-shot errors, and how does model confidence relate to correctability?
- •RQ3 \(Misalignment Susceptibility\):How do LLMs behave when given misaligned or incorrect task definitions, and can confidence scores detect such misalignment?

To answer these questions, we evaluate 9 LLMs across 5 toxicity detection datasets spanning social media, gaming, news, and forum domains\. We focus on toxicity due to the high potential for interpretive tension between LLMs and human task\-specific definitions\. We introduce*Definition Specific Familiarity \(DSF\)*, a metric that quantifies alignment between a model’s internal concept of a task and the dataset’s specific definition\. Our key findings are: \(1\) nearly two\-thirds of zero\-shot errors resist correction through any prompting strategy, including automated prompt refinement techniques, with an overall rescue rate of only 34\.8%; \(2\) high\-confidence errors are especially resistant to correction, exhibiting strong “decision stickiness”; \(3\) after controlling for dataset\-level confounds, DSF \(computed as the consensus across six sentence encoders, on an extendedN=54N=54panel\) is positively associated with which models perform better at annotation \(partialr=\+0\.41r=\+0\.41\), while text memorization shows no positive association \(partialr=−0\.19r=\-0\.19\); and \(4\) LLMs faithfully follow misaligned definitions yet remain highly confident even when applying incorrect instructions, revealing a critical calibration failure that prevents confidence\-based detection of definition errors\. These results suggest that for annotation tasks, conceptual alignment with task definitions, not data familiarity, determines model performance, and task concepts that the model has internalized through training and tuning anchor its behavior in ways that limit the effectiveness of prompt\-based steering\. Importantly, we treat prompting as the*control channel*and measure fundamental limits of that channel set by these internalized priors\. This complements recent work on steerability in generation tasks\(Chang et al\.,[2026](https://arxiv.org/html/2606.00467#bib.bib5)\)\. We find that our metric, DSF, is positively associated with which models perform better on a given task, and show that confidence scores provide no reliable signal for detecting when models are applying incorrect definitions\. Code and analysis pipelines are publicly available at[https://github\.com/etmaca5/llm\-internalized\-priors\-for\-annotation](https://github.com/etmaca5/llm-internalized-priors-for-annotation)\.

## 2Background and Related Work

#### LLM Steerability and Controllability\.

Recent work has formalized steerability as the degree to which a model’s behavior can be shifted from its baseline via interventions such as prompting\(Miehling et al\.,[2025](https://arxiv.org/html/2606.00467#bib.bib26)\)\.Chang et al\. \([2026](https://arxiv.org/html/2606.00467#bib.bib5)\)introduce a multi\-dimensional goal\-space framework for measuring steerability in text generation, identifying three key failure modes: poor coverage \(rare goals underrepresented\), miscalibration \(overshooting requests\), and side effects \(unintended changes to unspecified dimensions\)\. Their evaluation reveals that current LLMs struggle with reliable steerability even under various interventions including prompt engineering and reinforcement learning\. Our work addresses a complementary question: in*annotation*tasks where the output is a discrete label rather than generated text, what determines whether a model can be steered toward a specific task definition? We find analogous failure modes: limited reachability \(stubborn errors\), miscalibration \(flat confidence under misalignment\), and side effects \(rescue\-corruption tradeoffs\)\.

#### LLMs as Annotators and Judges\.

The use of LLMs for evaluating other models or annotating datasets has become a standard paradigm\.Gilardi et al\. \([2023](https://arxiv.org/html/2606.00467#bib.bib12)\)showed that ChatGPT can outperform crowd\-workers on text\-annotation tasks, whileZheng et al\. \([2023](https://arxiv.org/html/2606.00467#bib.bib52)\)established LLM\-as\-a\-judge evaluation with MT\-Bench\.Kim et al\. \([2024](https://arxiv.org/html/2606.00467#bib.bib19)\)introduced Prometheus, a model specifically fine\-tuned for fine\-grained rubric\-based evaluation, highlighting the importance of specialized alignment for assessment tasks\. However,Baumann et al\. \([2025](https://arxiv.org/html/2606.00467#bib.bib1)\)replicated 37 annotation tasks from 21 published studies using 18 different language models, finding that configuration choices can lead to incorrect conclusions in 31\-50% of cases\. Relatedly,Kocielnik et al\. \([2025a](https://arxiv.org/html/2606.00467#bib.bib21)\)show that aligning a model’s concept with a curated human definition is a prerequisite for reliable large\-scale annotation of player chat, motivating explicit definition\-alignment diagnostics\. With just a few prompt paraphrases across multiple LLMs, researchers can present most hypotheses as statistically significant \- a phenomenon coined “LLM hacking\.” This suggests that LLMs should be treated as complex instruments requiring rigorous validation rather than convenient black\-box annotators\.

#### Model\-Internalized Priors and Annotation Reliability\.

Modern LLMs develop strong internal priors across the full training pipeline \(pre\-training, instruction tuning, and RLHF/DPO\) that can conflict with user\-provided definitions\.Carlini et al\. \([2021](https://arxiv.org/html/2606.00467#bib.bib4)\)demonstrate that LLMs memorize specific training examples, suggesting models have ingrained conceptions of common concepts like toxicity\. Post\-training alignment can amplify these effects:Zhang et al\. \([2026](https://arxiv.org/html/2606.00467#bib.bib50)\)identify*typicality bias*where annotators systematically favor familiar text during RLHF, causing models to anchor to “typical” examples rather than following nuanced definitions\.Min et al\. \([2022](https://arxiv.org/html/2606.00467#bib.bib27)\)show that in\-context learning is often driven by format and label space rather than input\-label mappings, suggesting models may not fully override their priors even with explicit examples\. Prior work on benchmark contamination has developed metrics like Min\-K% Prob\(Shi et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib42)\), BERTScore\(Zhang et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib51)\), and continuation\-based detection\(Golchin & Surdeanu,[2024](https://arxiv.org/html/2606.00467#bib.bib13)\), but less is known about how internalized*conceptual*priors \(vs\. text memorization\) constrain steerability\.

#### Confidence and Calibration\.

Model accuracy alone can mask calibration failures critical for annotation reliability\. Surveys of LLM confidence estimation\(Geng et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib11)\)highlight challenges with overconfidence and unreliable uncertainty quantification\. Black\-box confidence elicitation methods reveal that current LLMs lack an internally coherent sense of confidence\(Han et al\.,[2025](https://arxiv.org/html/2606.00467#bib.bib15)\), often overstating self\-reported scores while showing wide variability when prompted to reconsider\(Xiong et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib47); Pawitan & Holmes,[2025](https://arxiv.org/html/2606.00467#bib.bib36)\)\. While verbalized confidence can be better calibrated than token probabilities under RLHF\(Tian et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib44)\), prompt design strongly shapes reliability\(Yang et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib48)\)\. We adopt verbalized confidence to enable applicability to closed\-source models that do not expose token\-level probability scores\.

## 3Methods

We propose a framework to evaluate the interplay between the priors an LLM has internalized through training and tuning \(familiarity\) and its steerability via prompting \(alignment\) in the context of annotation tasks\.

### 3\.1Familiarity Metrics\.

We distinguish between two types of familiarity that could influence annotation performance:*text familiarity*\(whether the model has memorized specific texts from the dataset\) and*definition familiarity*\(whether the model’s internal concept aligns with the task definition\)\. These represent different mechanisms: text familiarity suggests performance via generation of memorized content, while definition familiarity suggests performance via conceptual task alignment\.

#### Text Familiarity\.

Measures memorization by prompting the model to generate a continuation given a prefix comprising 40% of the text instance, then computingROUGE\-L F1\(based on longest common subsequence\)\(Lin,[2004](https://arxiv.org/html/2606.00467#bib.bib23)\)between the generated continuation and the remaining ground truth text\. This adapts the continuation\-based contamination detection approach ofGolchin & Surdeanu \([2024](https://arxiv.org/html/2606.00467#bib.bib13)\)for continuous familiarity measurement: rather than using dataset\-specific prompts for binary contamination detection, we use generic continuation prompts \(see Appendix[B](https://arxiv.org/html/2606.00467#A2)\) at temperatures 0\.0 and 0\.7, taking the maximum score\. To ensure findings are robust to the choice of memorization proxy, we additionally computeBERTScore F1\(Zhang et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib51)\)using DeBERTa\-XL contextual token embeddings, andembedding cosine similarityusing the same sentence encoders as DSF—so any difference between metrics reflects signal type \(lexical vs\. contextual vs\. semantic reproduction\) rather than encoder choice\. High scores indicate that the model can reproduce text that it likely encountered during training\.

#### Definition\-Specific Familiarity \(DSF\)\.

Quantifies alignment between a model’s internal understanding of the target phenomenon \(e\.g\., what constitutes toxicity\) and the dataset’s operational definition of that phenomenon\. The intuition is analogous to human annotators: two annotators given the same definition may interpret it differently based on their prior conceptions and identities\(Sap et al\.,[2022](https://arxiv.org/html/2606.00467#bib.bib40); Plank,[2022](https://arxiv.org/html/2606.00467#bib.bib37)\)\. An annotator whose mental model aligns with the definition will perform better than one with a mismatched conception, regardless of whether they have seen the specific texts before\.

To compute DSF, we prompt the model to explain its understanding of the target concept \(e\.g\., “In your own words, what makes content toxic?”\) and measure semantic similarity between this explanation and the dataset’s full definition using sentence embeddings\. To remove sensitivity to any single embedding model, we compute DSF as the unweighted mean \(*consensus DSF*\) across six diverse sentence encoders: MiniLM, MPNet, BGE\-large, E5\-large, Instructor\-large, and OpenAI’stext\-embedding\-3\-small\. Appendix[E\.1](https://arxiv.org/html/2606.00467#A5.SS1)shows that all six encoders produce positive partial correlations with accuracy and that pairwise DSF values are highly concordant \(r≥0\.88r\\geq 0\.88\)\. DSF–accuracy correlations are computed across all nine models and six datasets \(adding the Jigsaw Unintended Bias dataset\(Borkan et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib2)\)to the five primary datasets\), yieldingN=54N=54model–dataset pairs\. Crucially, DSF can be measured using only the task definition and a handful of prompts, without requiring labeled data or full annotation runs, making it a practical diagnostic for selecting model–task pairings\. This quantification captures the entire concept boundary including both positive and negative criteria\. We also tested alternative variants incorporating domain\-specific prompts and separate positive/negative class definitions, but these showed weaker correlation with model performance on a given task \(see Appendix[E](https://arxiv.org/html/2606.00467#A5)\)\.

### 3\.2Steerability Metrics

To quantify an LLM’s ability to correct its errors when provided with better instructions \(steerability\), we define theRescue Rate:

Rescue Rate=P\(Correct∣Prompted,Zero\-Shot Wrong\)\\text\{Rescue Rate\}=P\(\\text\{Correct\}\\mid\\text\{Prompted\},\\text\{Zero\-Shot Wrong\}\)\(1\)This metric captures the reachability of correct behaviors via prompting\. A low rescue rate indicates that many behaviors are unreachable through prompt\-based steering alone\. We definedecision stickinessas the tendency for high\-confidence errors to resist correction\.

To investigate how LLMs respond to incorrect instructions \(RQ3\), we employ*concept substitution*: providing a definition from a different dataset that operationalizes a related but semantically distinct concept\. For example, when annotating a general toxicity dataset, we substitute definitions of hate speech \(which requires identity\-based targeting\) for gaming toxicity \(which includes any disruptive behavior\)\. We test 6 misaligned conditions varying in*definition scope*: narrow definitions like hate speech require identity\-based targeting, while broad definitions like gaming toxicity capture any disruptive behavior\. Full definitions are in Appendix[A](https://arxiv.org/html/2606.00467#A1)\.

We measure four complementary metrics that decompose how prompting changes behavior relative to zero\-shot:Change Rate\(fraction of flipped labels\) captures how responsive a model is to the provided definition\.Rescue Rate\(P\(correct∣zero\-shot wrong\)P\(\\text\{correct\}\\mid\\text\{zero\-shot wrong\}\)\) quantifies how often prompting fixes errors\.Corruption Rate\(P\(wrong∣zero\-shot correct\)P\(\\text\{wrong\}\\mid\\text\{zero\-shot correct\}\)\) measures how often prompting breaks correct predictions\. Models that are easily steered can be steered in both beneficial and harmful directions\. Finally,Prediction Bias\(predicted positive rate \- true positive rate\) captures whether prompting shifts predictions toward over\-calling \(more false positives\) or under\-calling \(more false negatives\) relative to the dataset’s base rate\.

#### Confidence\.

We elicit verbalized confidence scores by prompting the model to output a numerical score \(0–100\) alongside its prediction\. The exact prompt suffix is shown in Figure[2](https://arxiv.org/html/2606.00467#S3.F2)\. We use the same template across all models; only the label string varies by dataset\. All calls use temperature 0\.

\[Base classification prompt\.\.\.\]Respond in exactly this format: PREDICTION: \[LABEL/not LABEL\] CONFIDENCE: \[0\-100\]

Figure 2:Confidence elicitation prompt suffix appended to all classification prompts\.LABELis replaced with the dataset\-specific positive class \(e\.g\.,toxic,hateful,offensive, etc\.\)\. The base prompt varies by condition \(zero\-shot, definition, few\-shot, misaligned, etc\.\)\. This template is identical across all models\.

### 3\.3Experimental Setup

#### Models\.

We evaluate nine instruction\-tuned models spanning open\-weights and proprietary systems \(Table[2](https://arxiv.org/html/2606.00467#S3.T2)\), organized in two tiers\. Six models form the*core*of our analysis, spanning dense architectures \(7B–70B parameters\) and mixture\-of\-experts architectures \(Mixtral\-8x7B, DeepSeek\-V3\) across the Llama and Mistral families\. All mixed\-effects regressions \(RQ2, RQ3\) are scoped to these six models to keep statistical inference consistent\. We additionally evaluate three*extended*models—GPT\-4o\-mini, Llama\-3\.3\-70B, and Qwen\-2\.5\-72B—under every prompting condition on every primary dataset\. These verify that our headline accuracy, rescue\-rate, and DSF findings extend to a broader, more recent slice of proprietary and open\-weights systems\. For MoE models, we report both total and active parameters per forward pass\. Appendix[G](https://arxiv.org/html/2606.00467#A7)details which models contribute to each table and figure\.

#### Datasets\.

We focus on toxicity and hate speech detection across five primary datasets covering a range of online communication environments, from formal news comments to informal gaming chats, each with a different operational definition of toxicity \(Table[1](https://arxiv.org/html/2606.00467#S3.T1)\)\. These five form the analytic core and appear in every results table in the main paper\. We additionally use three supplementary datasets for targeted robustness and generalization checks:*Jigsaw Unintended Bias*, which extends the DSF panel toN=54N=54model–dataset pairs and serves as a label\-bias robustness check for the rescue\-rate analysis; and*SemEval\-2018 Irony*together with*Subjectivity*which provide non\-toxicity replications of RQ1 and RQ2 \(Appendix[F](https://arxiv.org/html/2606.00467#A6)\)\. Appendix[G](https://arxiv.org/html/2606.00467#A7)maps each dataset to the specific analyses it contributes to\.

Table 1:Overview of datasets used in our experiments\. % Pos\. indicates the proportion of positive \(toxic/offensive\) labels in the original dataset\.†\{\\dagger\}only used in supplementary analysis\.Table 2:Overview of models evaluated\. Act\. = parameters active per forward pass \(MoE models\)\. N/D = not disclosed\.∗Estimated from unofficial sources; not officially disclosed by OpenAI\.
#### Experiments\.

We evaluate five prompting conditions: \(1\) Zero\-Shot: task name only with no additional context, \(2\) Aligned Definition: task name plus the dataset’s official definition of the target concept, \(3\) Few\-Shot: task name plus 4 random balanced examples, \(4\) FS\+Def: the aligned definition combined with 4 few\-shot examples, and \(5\) Misaligned: definitions from other datasets applied to the target dataset \(6 variants, detailed in Appendix[A](https://arxiv.org/html/2606.00467#A1)\)\. All prompts follow a consistent template where the classification question is followed by the message and an “Answer:” suffix\. We additionally run two DSPy\(Khattab et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib18)\)automated optimization conditions on all nine models: DSPy Optimized \(automatically selected few\-shot examples\) and DSPy Aligned \(selected examples plus aligned definition\)\. For each dataset\-model pair, we randomly sampled 1,000 instances\. All inference was conducted via API at temperature 0 for reproducibility\.

#### Statistical Analysis

For RQ1, we examined associations between familiarity metrics and annotation accuracy using regression with cluster\-robust standard errors \(to account for dependence within models\) and partial correlations controlling for dataset\. For RQ2 and RQ3, we used mixed effects logistic regression on the six open\-weights models to account for repeated observations of the same texts across conditions\. This handles the binary outcome \(correct vs\. incorrect\) while modeling the non\-independence of the same text being evaluated under multiple prompting conditions\. We include random intercepts for text and fixed effects for model, prompting condition, domain, and their interactions\.

## 4Results

Our core analysis covers predictions across 9 models, 5 toxicity datasets, and 10 prompting conditions, plus two DSPy automated optimization conditions\. Table[3](https://arxiv.org/html/2606.00467#S4.T3)summarizes model performance and steerability, while Table[5](https://arxiv.org/html/2606.00467#S4.T5)presents key findings from our mixed effects regression models\.

Table 3:Model performance \(Accuracy % / AUC\) across prompting conditions\. Misaligned averages 6 definition\-swap conditions \(full breakdown in Appendix[D](https://arxiv.org/html/2606.00467#A4)\)\.†DSPy Opt\. degrades Llama\-3\.1\-8B \(76\.7→\\to72\.3\) and Llama\-3\.3\-70B \(80\.5→\\to77\.5\), consistent withTan et al\. \([2025](https://arxiv.org/html/2606.00467#bib.bib43)\)\.

### 4\.1Answering RQ1: Impact of Model Familiarity

We analyzed whether annotation performance correlates more strongly with text memorization or semantic task definition alignment \(Definition\-Specific Familiarity, DSF\)\. Table[4](https://arxiv.org/html/2606.00467#S4.T4)reports the statistical association between three memorization metrics \(ROUGE\-L, BERTScore, and embedding cosine similarity\) and DSF against task performance\. Unadjusted correlations suggest a strong negative relationship between memorization and accuracy \(r=−0\.80r=\-0\.80for ROUGE\-L,−0\.76\-0\.76for BERTScore,−0\.71\-0\.71for embedding similarity\)\. However, these are not directly interpretable because memorization scores are correlated with dataset difficulty \(e\.g\., Jigsaw is both high\-memorization and hard\), making them sensitive to between\-dataset differences\.

After controlling for dataset difficulty using partial correlations, all three memorization metrics show no positive association \(r=−0\.19,−0\.15,−0\.16r=\-0\.19,\-0\.15,\-0\.16respectively\)\. In contrast, consensus DSF is significantly positively associated with performance \(r=\+0\.41r=\+0\.41,p=0\.003p=0\.003,N=54N=54model–dataset pairs\), an effect that remains robust even after controlling for text complexity \(negative log\-likelihood\)\. This confirms that DSF captures semantic alignment rather than simple processing fluency, indicating that*models whose internal concepts align with the task definition perform significantly better*\. The same direction is observed across each of the six embedding models used to compute DSF \(per\-embedding partialrrranges from\+0\.30\+0\.30to\+0\.49\+0\.49; see Appendix[E\.1](https://arxiv.org/html/2606.00467#A5.SS1)\)\. The partial correlation is further amplified in few\-shot settings \(partialr=\+0\.62r=\+0\.62,N=54N=54\); when an aligned definition is provided it attenuates to partialr=\+0\.25r=\+0\.25\(N=54N=54,p=0\.085p=0\.085\), consistent with explicit definitions compressing between\-model variance in DSF\.

Crucially,*the relation also holds on two non\-safety domains*\(irony and subjectivity detection\), where DSF remains positively associated with zero\-shot accuracy \(partialr=\+0\.34r=\+0\.34,N=18N=18model–dataset pairs\); see Appendix[F](https://arxiv.org/html/2606.00467#A6)\.

Table 4:Correlations between familiarity metrics and zero\-shot annotation accuracy on the extendedN=54N=54panel\. Raw correlations are misleading due to dataset confounds; partial correlations \(controlling for dataset\) reveal that definition alignment \(consensus DSF\) is positively associated with performance while three distinct memorization metrics are not\.
### 4\.2Answering RQ2: Impact of Knowledge Injection

Table[5](https://arxiv.org/html/2606.00467#S4.T5)presents results from mixed effects logistic regression models examining factors associated with annotation correctness and error correction\. Odds Ratios \(OR\) indicate the multiplicative change in odds:OR=2\.0OR=2\.0means the odds double when a factor is present, whileOR=0\.5OR=0\.5means the odds are halved\.

Domain is most strongly associated with correctness: Gaming is approximately 14 times easier than Forum \(OR=13\.85OR=13\.85\), while Social Media is 5 times easier \(OR=5\.45OR=5\.45\)\. Notably, aligned definitions provide only minimal improvement over the zero\-shot baseline \(\+2\.2% accuracy on average, Table[3](https://arxiv.org/html/2606.00467#S4.T3)\), while misaligned definitions perform worse than aligned \(\-1\.8% andOR=1\.08OR=1\.08for aligned vs\. misaligned\)\. DSPy automated optimization produces similarly modest gains on average: 80\.2% \(DSPy Opt\.\) and 81\.5% \(DSPy Aligned\) are within 0\.5pp of the zero\-shot \(80\.3%\) and aligned definition \(82\.0%\) baselines respectively, confirming that the performance ceiling reflects model\-internalized priors rather than suboptimal prompt design\. Gains are heterogeneous across models: GPT\-4o\-mini improves slightly \(84\.4%\), while both Llama models degrade under DSPy Opt\. — Llama\-3\.1\-8B from 76\.7% to 72\.3% and Llama\-3\.3\-70B from 80\.5% to 77\.5% — consistent withTan et al\. \([2025](https://arxiv.org/html/2606.00467#bib.bib43)\), who find that DSPy optimizers can degrade performance relative to the unoptimized baseline across Llama\-family models, with 10th\-percentile relative gains as low as−13\.9%\-13\.9\\%\.

#### Zero\-shot Correctness\.

Zero\-shot correctness is highly associated with prompted correctness \(OR=6\.43OR=6\.43\)\. Conditional on the other covariates in the model, a correct zero\-shot response corresponds to6\.4×6\.4\\timeshigher odds of being correct when provided aligned definitions or examples\. This suggests that prompting is more effective at consolidating correct answers than at rescuing errors\.

Table 5:Key findings from mixed effects logistic regression\. OR\>\>1 indicates higher odds of a correct annotation\. Reference categories: Forum \(domain\), Llama\-8B \(model\)\. Full model specifications in Appendix[C](https://arxiv.org/html/2606.00467#A3)\. Allp<0\.001p<0\.001\.FactorOR \[95% CI\]Factors Associated with Correctness:Gaming \(vs\. Forum\)13\.8513\.85\[13\.31, 14\.41\]Social Media \(vs\. Forum\)5\.455\.45\[5\.27, 5\.63\]Aligned Definition1\.081\.08\[1\.04, 1\.11\]Aligned×\\timesNews1\.911\.91\[1\.80, 2\.02\]Zero\-shot Correct6\.436\.43\[6\.28, 6\.58\]Factors Associated with Error Correction:ZS Confidence \(per SD\)0\.840\.84\[0\.82, 0\.87\]Table 6:Rescue rate by model, aggregated across prompted conditions\. Rescue rate isP\(correct∣prompted,zero\-shot wrong\)P\(\\text\{correct\}\\mid\\text\{prompted\},\\text\{zero\-shot wrong\}\)\.*Rescue Rate*is the probability that prompting corrects a zero\-shot error\. The overall rescue rate is only 34\.8%, meaning nearly two\-thirds of errors resist correction\. Decision stickiness similarly replicates on the non\-safety domains, with an overall few\-shot rescue rate of 45\.0% acrossN=3,052N=3\{,\}052zero\-shot errors \(Appendix[F](https://arxiv.org/html/2606.00467#A6)\)\. Figure[3](https://arxiv.org/html/2606.00467#S4.F3)shows an inverted\-U: rescue probability peaks at 51\.8% for moderate confidence \(0\.6–0\.7\) and falls sharply to 20\.8% for high\-confidence errors \(\>0\.9\>0\.9\), the most statistically reliable region of the curve\. Model 4 corroborates this trend: each standard\-deviation increase in zero\-shot confidence reduces rescue odds by 16% \(OR=0\.84OR=0\.84, Table[5](https://arxiv.org/html/2606.00467#S4.T5)\)\.*When models are confidently wrong, prompting rarely helps\.*

The drop at the extreme low\-confidence end \(n<n<1,000\) is a distinct failure mode rather than a counterexample to decision stickiness\. These instances are out\-of\-distribution and context\-deficient: their ROUGE\-L is far below the rest of the data \(0\.025 vs\. 0\.061\) and they are dramatically shorter \(median 6 vs\. 22 words\), consisting largely of single\-word gaming fragments \(e\.g\., “noob”, “wtf”, “bot”\) whose toxicity depends on context that is unavailable to the model\.

Importantly, steerability is not a monotonic function of model size: for example, Mistral\-Small\-24B rescues more zero\-shot errors than Llama\-3\.1\-70B in our experiments, despite being substantially smaller \(Table[6](https://arxiv.org/html/2606.00467#S4.T6)\)\. To confirm this finding is not an artifact of label\-bias issues known to affect the standard Jigsaw dataset \(e\.g\., systematic flagging of African American Vernacular English as toxic\), we replicated the rescue rate analysis on the Jigsaw Unintended Bias dataset\(Borkan et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib2)\), which was explicitly designed to reduce identity\-group annotation bias\. Rescue rates remain uniformly below 50% across all nine models \(Table[19](https://arxiv.org/html/2606.00467#A4.T19)\), confirming that decision stickiness is not specific to Jigsaw’s label properties\.

![Refer to caption](https://arxiv.org/html/2606.00467v1/rescue_vs_confidence.png)Figure 3:Rescue probability vs\. zero\-shot confidence for zero\-shot errors\. The inverted\-U shape exhibits two distinct failure modes \(analyzed in[*Confidence and Decision Stickiness*](https://arxiv.org/html/2606.00467#S4.T6)\):*decision stickiness*at high confidence \(right tail\) and an out\-of\-distribution effect at very low confidence \(left tail\)\.

### 4\.3Answering RQ3: Misalignment Impact

We analyze 6 misaligned definition conditions to investigate how LLMs respond to incorrect instructions \(Table[7](https://arxiv.org/html/2606.00467#S4.T7)\)\. The results reveal several surprising patterns\.

We find that LLMs are responsive to definition scope\. Narrow definitions \(requiring targeting based on race, religion, gender, etc\.\) produce prediction bias of−7%\-7\\%to−12%\-12\\%\(under\-prediction\), while broad definitions produce bias of\+9%\+9\\%to\+13%\+13\\%\(over\-prediction\)\. This shift in prediction rates demonstrates that LLMs do not simply ignore misaligned instructions; they adjust decision threshold to match the provided criteria\. This responsiveness is a double\-edged sword: while it enables definition\-based steering, it also means models will confidently apply incorrect definitions\.

The worst misaligned condition \(gaming toxicity: 76\.4%\) is 5\.6% below aligned, while the best \(Fox News hate speech: 82\.6%\) actually*outperforms*the aligned definition by 0\.6%\. Knowledge built up during training and tuning constrains how much definitions can shift behavior: even with “wrong” definitions, models remain relatively stable, indicating persistent influence from these internalized task concepts\.

Surprisingly, some misaligned definitions provide net benefit over zero\-shot \(rescuing more errors than they corrupt\)\. For instance, using a “hate speech” definition can improve performance on general toxicity tasks by focusing the model on clearer criteria, even if technically misaligned\. This finding challenges the assumption that the “correct” definition is always optimal \- empirically, some misaligned definitions better match the model’s internal concept\.

Table 7:Misalignment analysis: LLMs adjust prediction thresholds based on definition, yet maintain high confidence\. Rows are ordered by definition breadth \(Narrow→\\toBroad\)\. Confidence remains 84–91% regardless of alignment\. All values are percentages\.Table 8:Mean self\-reported confidence \(%\) by prompting condition\. Zero\-shot confidence is essentially unchanged relative to aligned and misaligned definitions\.Critical Calibration Failure\.Models*remain confident even when applying misaligned definitions*\(Table[7](https://arxiv.org/html/2606.00467#S4.T7)\), with confidence levels remaining essentially unchanged from the zero\-shot baseline \(Table[8](https://arxiv.org/html/2606.00467#S4.T8)\)\. The narrowest definition \(Fox News hate speech\) produces the*highest*confidence \(91\.2%\), not the lowest\. This represents a fundamental calibration failure: models should exhibit increased uncertainty when instructions contradict their training distribution, but they do not\. As shown in Figure[4](https://arxiv.org/html/2606.00467#S4.F4), all conditions \(zero\-shot, aligned, and misaligned\) exhibit similar overconfidence patterns, with calibration curves falling below the diagonal\. Critically, there is no meaningful separation between aligned and misaligned definitions, meaning*practitioners cannot rely on confidence scores to detect definition errors*\.

![Refer to caption](https://arxiv.org/html/2606.00467v1/calibration_curve.png)Figure 4:Calibration curves showing confidence vs\. actual accuracy across conditions\. All conditions exhibit overconfidence \(below the diagonal\), with no meaningful separation between aligned and misaligned definitions: models cannot distinguish when they are applying incorrect instructions\.Across all models and datasets, definition choice produces 17% accuracy variation \(e\.g\., for Fox News, accuracy ranges from 61\.6% with the “Gaming Toxicity” definition to 78\.4% with the “Twitter Hate Speech” definition\), while model choice produces only 5% variation on average\. This implies that for annotation tasks,*careful definition design is more impactful than selecting larger or more capable models*\.

Models showing high rescue rates under misalignment also show high corruption rates \(Table[6](https://arxiv.org/html/2606.00467#S4.T6)and Table[18](https://arxiv.org/html/2606.00467#A4.T18)\)\. Mistral\-Small\-24B achieves the highest rescue rate \(73\.7% with Twitter hate speech\) but also the highest corruption rate \(21\.0%\)\. This creates a fundamental trade\-off:*models that are easily steered by definitions can be steered in both beneficial and harmful directions*\.

## 5Discussion

Our results suggest that LLM\-based annotation is often constrained less by prompt design than by the fit between a model’s internalized concept and the task’s definition\. Across models and datasets, Definition\-Specific Familiarity \(DSF\) is positively associated with accuracy after accounting for between\-dataset differences, whereas three distinct text memorization metrics \(ROUGE\-L, BERTScore, embedding similarity\) all fail to\. This pattern supports a concept\-level view of annotation: performance depends mainly on whether the model’s implicit decision rule matches the dataset’s definition \(i\.e\., where it draws the toxic/non\-toxic line\), rather than on text\-level memorization\.

#### Definition alignment matters more than contamination for annotation\.

A common concern is that benchmark performance reflects memorization of test examples\. In our setting, cross\-dataset association between memorization \(whether measured via ROUGE\-L, BERTScore, or embedding similarity\) and accuracy, once we control for difficulty and domain differences, provides little explanatory power at the model level, while DSF shows a consistent positive correlation\. Practically, this shifts the emphasis from auditing overlap with benchmark text toward evaluating whether the model’s*concept boundary*matches the intended definition\.

#### Prompting has limited ability to overturn wrong decisions\.

We find substantial*decision stickiness*: most zero\-shot errors persist under aligned definitions and few\-shot examples, and high\-confidence errors are especially resistant to correction \(Figure[3](https://arxiv.org/html/2606.00467#S4.F3), Table[5](https://arxiv.org/html/2606.00467#S4.T5)\)\. This indicates that prompting primarily stabilizes or modestly improves predictions the model already gets right, rather than serving as a reliable mechanism for error recovery\. We further test whether iterative, history\-aware prompting can break this*stickiness*: a three\-turn rescue sequence that progressively adds few\-shot examples, the aligned definition, and an explicit reconsideration prompt raises rescue from 7\.5% at Turn 1 to only 18\.7% at Turn 3, and lifts high\-confidence rescue to just 8\.5% \(Appendix[D\.1](https://arxiv.org/html/2606.00467#A4.SS1)\)\. Decision stickiness therefore persists under iterative correction, not only one\-shot prompting\. Notably, steerability is not monotonic in model size \(Table[6](https://arxiv.org/html/2606.00467#S4.T6)\), implying that larger models are not necessarily easier to steer for annotation workloads\.

#### Misaligned definitions shift behavior without reliably signaling risk\.

When given semantically related but mismatched definitions, models change their prediction rates in systematic ways \(Table[7](https://arxiv.org/html/2606.00467#S4.T7), see also Table[16](https://arxiv.org/html/2606.00467#A4.T16)in Appendix\), consistent with literal application of the provided criteria\. However, accuracy often degrades only modestly and sometimes even improves relative to the aligned condition\. One interpretation is that the model’s training supplies a strong prior that anchors behavior, providing a performance floor even under imperfect instructions\. Another possibility is that dataset label boundaries may be closer to some alternative formulations than the dataset’s stated definition, especially for inherently subjective constructs\. Regardless, the key operational point is that*definition wording is a first\-order experimental factor*: in our study, definition choice can induce substantial performance variation, so treating the definition as fixed can meaningfully understate uncertainty in downstream analyses\.

#### Confidence is conditional, not diagnostic of definition validity\.

A central deployment risk is that models remain highly confident under misalignment, and calibration curves show little separation between aligned and misaligned conditions \(Figure[4](https://arxiv.org/html/2606.00467#S4.F4)\)\. Importantly, this does*not*imply that the model “should know” the definition is wrong; rather, it suggests that the reported confidence reflects certainty*given the provided instructions*, not a meta\-uncertainty about whether those instructions match the intended labeling policy\. As a result, confidence thresholds are poorly suited for detecting definition errors or definition\-dataset mismatch\.

#### Practical implications for LLM annotation pipelines\.

These findings suggest three lightweight safeguards\. First, measure*definition alignment*before large\-scale annotation \(e\.g\., DSF\-style checks\) to anticipate which model–definition pairs are likely to work well\. Second,*stress\-test definitions*by evaluating multiple plausible formulations and reporting sensitivity \(e\.g\., changes in bias, rescue, and corruption\), rather than assuming a single “correct” wording is robust\. Third, avoid using confidence as a proxy for definition appropriateness\.

#### Limitations and future work\.

Our conclusions are based on toxicity\-related binary classification with concept\-substitution misalignment; other task types \(multi\-class, span labeling, or open\-ended judging\) may exhibit different failure modes\. We report DSF as a consensus across six diverse sentence encoders, mitigating embedding\-choice sensitivity \(Appendix[E\.1](https://arxiv.org/html/2606.00467#A5.SS1)\)\. Robustness to alternative similarity measures \(e\.g\., Mahalanobis distance, learned metrics\) and elicitation prompts remains an important next step\. Importantly, all evaluated models are instruction\-tuned, and we cannot distinguish whether low rescue rates reflect capability limitations or intentional design choices: alignment training may deliberately limit steerability for safety reasons, making some “stickiness” a feature rather than a bug\. Our design also cannot isolate which training stage \(pre\-training, instruction tuning, or RLHF/DPO\) produces the observed concept anchoring; we use*model\-internalized priors*as a stage\-agnostic term, and disentangling the sources would require comparing base and post\-trained variants of the same model\. Finally, our findings are correlational: establishing that DSF causally drives accuracy \(or confidence causally drives stickiness\) would require interventions such as targeted fine\-tuning that manipulates DSF while holding other factors constant\. Future work could test stronger forms of misalignment, alternative multi\-turn correction strategies beyond the protocol in Appendix[D\.1](https://arxiv.org/html/2606.00467#A4.SS1)\(e\.g\., self\-consistency, debate, or critique\-and\-revise\), and scalable procedures for selecting model–definition pairs under explicit constraints\.

## 6Conclusion

We study how model\-internalized priors interact with prompt\-based steering in LLM annotation\. Across 9 models and 5 toxicity datasets, we find that performance is best explained by*definition\-specific familiarity*\(DSF\): controlling for dataset difficulty, DSF is positively associated with accuracy \(partialr=\+0\.41r=\+0\.41,N=54N=54\), while three distinct memorization metrics \(ROUGE\-L, BERTScore, embedding similarity\) are not\. Prompting often fails to correct mistakes, as most zero\-shot errors persist even under aligned definitions and few\-shot examples\. Models can confidently follow*misaligned*definitions, making confidence an unreliable indicator of definition correctness\. Finally, definition wording induces larger performance swings than model choice, highlighting definition design as crucial\. Practically, LLM annotation pipelines should prioritize measuring definition alignment as an early indicator of performance rather than relying on larger models or confidence\-based strategies\.

## Acknowledgment

This work is supported by the Caltech Linde Center for Science, Society, and Public Policy \(LCSSP\)\. R\. Michael Alvarez is Flintridge Foundation Professor of Political and Computational Social Science at Caltech\. We thank Andrea Boonyarungsrit, Grant Cahill, Jonathan Lane, Min Kim, and Fereshteh Soltani for prior collaboration that helped inform some of the motivating challenges behind this work\. The views and opinions expressed in this paper are solely those of the authors and do not necessarily reflect those of the individuals acknowledged or their affiliated organizations\.

## Impact Statement

This work examines how model\-internalized priors interact with prompt\-based steering in LLM annotation tasks\. Our findings have implications for practitioners deploying LLMs as annotators or judges\. Positively, identifying failure modes such as decision stickiness and miscalibration under definition misalignment can help practitioners develop better validation procedures and avoid over\-reliance on confidence scores\. However, our results also reveal that models will confidently apply incorrect definitions when provided, which could lead to systematic errors if task specifications are poorly designed\.

## References

- Baumann et al\. \(2025\)Baumann, J\., Röttger, P\., Urman, A\., Wendsjö, A\., Plaza\-del Arco, F\. M\., Gruber, J\. B\., and Hovy, D\.Large language model hacking: Quantifying the hidden risks of using LLMs for text annotation\.*arXiv preprint arXiv:2509\.08825*, 2025\.doi:10\.48550/arXiv\.2509\.08825\.URL[https://arxiv\.org/abs/2509\.08825](https://arxiv.org/abs/2509.08825)\.
- Borkan et al\. \(2019\)Borkan, D\., Dixon, L\., Sorensen, J\., Thain, N\., and Vasserman, L\.Nuanced metrics for measuring unintended bias with real data for text classification\.In*Companion Proceedings of The 2019 World Wide Web Conference*, WWW ’19, pp\. 491–500, New York, NY, USA, 2019\. Association for Computing Machinery\.ISBN 9781450366755\.doi:10\.1145/3308560\.3317593\.URL[https://doi\.org/10\.1145/3308560\.3317593](https://doi.org/10.1145/3308560.3317593)\.
- Brown et al\. \(2020\)Brown, T\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\. D\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., Agarwal, S\., Herbert\-Voss, A\., Krueger, G\., Henighan, T\., Child, R\., Ramesh, A\., Ziegler, D\., Wu, J\., Winter, C\., Hesse, C\., Chen, M\., Sigler, E\., Litwin, M\., Gray, S\., Chess, B\., Clark, J\., Berner, C\., McCandlish, S\., Radford, A\., Sutskever, I\., and Amodei, D\.Language models are few\-shot learners\.In Larochelle, H\., Ranzato, M\., Hadsell, R\., Balcan, M\., and Lin, H\. \(eds\.\),*Advances in Neural Information Processing Systems*, volume 33, pp\. 1877–1901\. Curran Associates, Inc\., 2020\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a\-Paper\.pdf](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)\.
- Carlini et al\. \(2021\)Carlini, N\., Tramèr, F\., Wallace, E\., Jagielski, M\., Herbert\-Voss, A\., Lee, K\., Roberts, A\., Brown, T\., Song, D\., Erlingsson, Ú\., Oprea, A\., and Raffel, C\.Extracting training data from large language models\.In*30th USENIX Security Symposium \(USENIX Security 21\)*, pp\. 2633–2650\. USENIX Association, August 2021\.ISBN 978\-1\-939133\-24\-3\.URL[https://www\.usenix\.org/conference/usenixsecurity21/presentation/carlini\-extracting](https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting)\.
- Chang et al\. \(2026\)Chang, T\., Schnabel, T\., Swaminathan, A\., and Wiens, J\.A course correction in steerability evaluation: Revealing miscalibration and side effects in LLMs\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, 2026\.URL[https://arxiv\.org/abs/2505\.23816](https://arxiv.org/abs/2505.23816)\.
- cjadams et al\. \(2018\)cjadams, Sorensen, J\., Elliott, J\., Dixon, L\., McDonald, M\., nithum, and Cukierski, W\.Toxic comment classification challenge\.Kaggle Competition, 2018\.URL[https://www\.kaggle\.com/c/jigsaw\-toxic\-comment\-classification\-challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)\.
- Davidson et al\. \(2017\)Davidson, T\., Warmsley, D\., Macy, M\., and Weber, I\.Automated hate speech detection and the problem of offensive language\.*Proceedings of the International AAAI Conference on Web and Social Media*, 11\(1\):512–515, 2017\.doi:10\.1609/icwsm\.v11i1\.14955\.
- DeepSeek\-AI \(2024\)DeepSeek\-AI\.Deepseek\-v3 technical report, 2024\.URL[https://arxiv\.org/abs/2412\.19437](https://arxiv.org/abs/2412.19437)\.
- ElSherief et al\. \(2018\)ElSherief, M\., Kulkarni, V\., Nguyen, D\., Wang, W\. Y\., and Belding, E\.Hate lingo: A target\-based linguistic analysis of hate speech in social media\.*Proceedings of the International AAAI Conference on Web and Social Media*, 12\(1\), June 2018\.doi:10\.1609/icwsm\.v12i1\.15041\.URL[https://ojs\.aaai\.org/index\.php/ICWSM/article/view/15041](https://ojs.aaai.org/index.php/ICWSM/article/view/15041)\.
- Gao & Huang \(2017\)Gao, L\. and Huang, R\.Detecting online hate speech using context aware models\.In Mitkov, R\. and Angelova, G\. \(eds\.\),*Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017*, pp\. 260–266, Varna, Bulgaria, September 2017\. INCOMA Ltd\.doi:10\.26615/978\-954\-452\-049\-6˙036\.URL[https://aclanthology\.org/R17\-1036/](https://aclanthology.org/R17-1036/)\.
- Geng et al\. \(2024\)Geng, J\., Cai, F\., Wang, Y\., Koeppl, H\., Nakov, P\., and Gurevych, I\.A survey of confidence estimation and calibration in large language models\.In Duh, K\., Gomez, H\., and Bethard, S\. \(eds\.\),*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pp\. 6577–6595, Mexico City, Mexico, June 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.naacl\-long\.366\.URL[https://aclanthology\.org/2024\.naacl\-long\.366/](https://aclanthology.org/2024.naacl-long.366/)\.
- Gilardi et al\. \(2023\)Gilardi, F\., Alizadeh, M\., and Kubli, M\.Chatgpt outperforms crowd workers for text\-annotation tasks\.*Proceedings of the National Academy of Sciences*, 120\(30\):e2305016120, 2023\.doi:10\.1073/pnas\.2305016120\.URL[https://www\.pnas\.org/doi/abs/10\.1073/pnas\.2305016120](https://www.pnas.org/doi/abs/10.1073/pnas.2305016120)\.
- Golchin & Surdeanu \(2024\)Golchin, S\. and Surdeanu, M\.Time travel in LLMs: Tracing data contamination in large language models\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=2Rwq6c3tvr](https://openreview.net/forum?id=2Rwq6c3tvr)\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.URL[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.
- Han et al\. \(2025\)Han, P\., Kocielnik, R\., Song, P\., Debnath, R\., Mobbs, D\., Anandkumar, A\., and Alvarez, R\. M\.The personality illusion: Revealing dissociation between self\-reports & behavior in LLMs\.*arXiv preprint arXiv:2509\.03730*, 2025\.URL[https://arxiv\.org/abs/2509\.03730](https://arxiv.org/abs/2509.03730)\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., et al\.Mistral 7b\.*arXiv preprint arXiv:2310\.06825*, 2023\.
- Jiang et al\. \(2024\)Jiang, A\. Q\., Sablayrolles, A\., Roux, A\., Mensch, A\., Savary, B\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bou Hanna, E\., Bressand, F\., Lengyel, G\., Bour, G\., Lample, G\., Lavaud, L\. R\., Saulnier, L\., Lachaux, M\.\-A\., Stock, P\., Subramanian, S\., Yang, S\., Antoniak, S\., Le Scao, T\., Gervet, T\., Lavril, T\., Wang, T\., Lacroix, T\., and El Sayed, W\.Mixtral of experts\.*arXiv preprint arXiv:2401\.04088*, 2024\.URL[https://arxiv\.org/abs/2401\.04088](https://arxiv.org/abs/2401.04088)\.
- Khattab et al\. \(2023\)Khattab, O\., Singhvi, A\., Maheshwari, P\., Zhang, Z\., Santhanam, K\., Vardhamanan, S\., Haq, S\., Sharma, A\., Joshi, T\. T\., Moazam, H\., Miller, H\., Zaharia, M\., and Potts, C\.Dspy: Compiling declarative language model calls into self\-improving pipelines\.*ArXiv*, abs/2310\.03714, 2023\.URL[https://api\.semanticscholar\.org/CorpusID:263671701](https://api.semanticscholar.org/CorpusID:263671701)\.
- Kim et al\. \(2024\)Kim, S\., Shin, J\., Cho, Y\., Jang, J\., Longpre, S\., Lee, H\., Yun, S\., Shin, S\., Kim, S\., Thorne, J\., and Seo, M\.Prometheus: Inducing fine\-grained evaluation capability in language models\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=8euJaTveKw](https://openreview.net/forum?id=8euJaTveKw)\.
- Kocielnik et al\. \(2024\)Kocielnik, R\., Li, Z\., Kann, C\., Sambrano, D\., Morrier, J\., Linegar, M\., Taylor, C\., Kim, M\., Naqvie, N\., Soltani, F\., Dehpanah, A\., Cahill, G\., Anandkumar, A\., and Alvarez, R\. M\.Challenges in moderating disruptive player behavior in online competitive action games\.*Frontiers in Computer Science*, 6:1283735, 2024\.doi:10\.3389/fcomp\.2024\.1283735\.URL[https://www\.frontiersin\.org/journals/computer\-science/articles/10\.3389/fcomp\.2024\.1283735/full](https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2024.1283735/full)\.
- Kocielnik et al\. \(2025a\)Kocielnik, R\., Kim, M\., Boonyarungsrit, P\. A\., Soltani, F\., Sambrano, D\., Anandkumar, A\., and Alvarez, R\. M\.Prosocial behavior detection in player game chat: From aligning human\-AI definitions to efficient annotation at scale\.*arXiv preprint arXiv:2508\.05938*, 2025a\.URL[https://arxiv\.org/abs/2508\.05938](https://arxiv.org/abs/2508.05938)\.
- Kocielnik et al\. \(2025b\)Kocielnik, R\., Li, Z\., Linegar, M\., Sambrano, D\., Soltani, F\., Kim, M\., Naqvie, N\., Cahill, G\., Anandkumar, A\., and Alvarez, R\. M\.Online moderation in competitive action games: How intervention affects player behaviors\.*Proc\. ACM Hum\.\-Comput\. Interact\.*, 9\(6\), October 2025b\.doi:10\.1145/3748599\.URL[https://doi\.org/10\.1145/3748599](https://doi.org/10.1145/3748599)\.
- Lin \(2004\)Lin, C\.\-Y\.ROUGE: A package for automatic evaluation of summaries\.In*Text Summarization Branches Out*, pp\. 74–81, Barcelona, Spain, July 2004\. Association for Computational Linguistics\.URL[https://aclanthology\.org/W04\-1013/](https://aclanthology.org/W04-1013/)\.
- Liu et al\. \(2023\)Liu, Y\., Iter, D\., Xu, Y\., Wang, S\., Xu, R\., and Zhu, C\.G\-eval: NLG evaluation using gpt\-4 with better human alignment\.In Bouamor, H\., Pino, J\., and Bali, K\. \(eds\.\),*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pp\. 2511–2522, Singapore, December 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.emnlp\-main\.153\.URL[https://aclanthology\.org/2023\.emnlp\-main\.153/](https://aclanthology.org/2023.emnlp-main.153/)\.
- Meta AI \(2024\)Meta AI\.Llama 3\.3 model card\.[https://huggingface\.co/meta\-llama/Llama\-3\.3\-70B\-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), December 2024\.
- Miehling et al\. \(2025\)Miehling, E\., Desmond, M\., Natesan Ramamurthy, K\., Daly, E\. M\., Varshney, K\. R\., Farchi, E\., Dognin, P\., Rios, J\., Bouneffouf, D\., Liu, M\., and Sattigeri, P\.Evaluating the prompt steerability of large language models\.In Chiruzzo, L\., Ritter, A\., and Wang, L\. \(eds\.\),*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pp\. 7874–7900, Albuquerque, New Mexico, April 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-189\-6\.doi:10\.18653/v1/2025\.naacl\-long\.400\.URL[https://aclanthology\.org/2025\.naacl\-long\.400/](https://aclanthology.org/2025.naacl-long.400/)\.
- Min et al\. \(2022\)Min, S\., Lyu, X\., Holtzman, A\., Artetxe, M\., Lewis, M\., Hajishirzi, H\., and Zettlemoyer, L\.Rethinking the role of demonstrations: What makes in\-context learning work?In Goldberg, Y\., Kozareva, Z\., and Zhang, Y\. \(eds\.\),*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pp\. 11048–11064, Abu Dhabi, United Arab Emirates, December 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.emnlp\-main\.759\.URL[https://aclanthology\.org/2022\.emnlp\-main\.759/](https://aclanthology.org/2022.emnlp-main.759/)\.
- Mistral AI \(2025\)Mistral AI\.Mistral\-small\-24b\-instruct\-2501\.[https://huggingface\.co/mistralai/Mistral\-Small\-24B\-Instruct\-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), 2025\.
- Mohammad et al\. \(2016\)Mohammad, S\., Kiritchenko, S\., Sobhani, P\., Zhu, X\., and Cherry, C\.SemEval\-2016 task 6: Detecting stance in tweets\.In Bethard, S\., Carpuat, M\., Cer, D\., Jurgens, D\., Nakov, P\., and Zesch, T\. \(eds\.\),*Proceedings of the 10th International Workshop on Semantic Evaluation \(SemEval\-2016\)*, pp\. 31–41, San Diego, California, June 2016\. Association for Computational Linguistics\.doi:10\.18653/v1/S16\-1003\.URL[https://aclanthology\.org/S16\-1003/](https://aclanthology.org/S16-1003/)\.
- Nadeem et al\. \(2021\)Nadeem, M\., Bethke, A\., and Reddy, S\.StereoSet: Measuring stereotypical bias in pretrained language models\.In Zong, C\., Xia, F\., Li, W\., and Navigli, R\. \(eds\.\),*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pp\. 5356–5371, Online, August 2021\. Association for Computational Linguistics\.doi:10\.18653/v1/2021\.acl\-long\.416\.URL[https://aclanthology\.org/2021\.acl\-long\.416/](https://aclanthology.org/2021.acl-long.416/)\.
- Nangia et al\. \(2020\)Nangia, N\., Vania, C\., Bhalerao, R\., and Bowman, S\. R\.CrowS\-pairs: A challenge dataset for measuring social biases in masked language models\.In Webber, B\., Cohn, T\., He, Y\., and Liu, Y\. \(eds\.\),*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 1953–1967, Online, November 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-main\.154\.URL[https://aclanthology\.org/2020\.emnlp\-main\.154/](https://aclanthology.org/2020.emnlp-main.154/)\.
- Naseem et al\. \(2025\)Naseem, U\., Shiwakoti, S\., Shah, S\. B\., Thapa, S\., and Zhang, Q\.GameTox: A comprehensive dataset and analysis for enhanced toxicity detection in online gaming communities\.In Chiruzzo, L\., Ritter, A\., and Wang, L\. \(eds\.\),*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\)*, pp\. 440–447, Albuquerque, New Mexico, April 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-190\-2\.doi:10\.18653/v1/2025\.naacl\-short\.37\.URL[https://aclanthology\.org/2025\.naacl\-short\.37/](https://aclanthology.org/2025.naacl-short.37/)\.
- OpenAI \(2024\)OpenAI\.GPT\-4o mini: Advancing cost\-efficient intelligence\.OpenAI Blog, 2024\.URL[https://openai\.com/blog/gpt\-4o\-mini\-advancing\-cost\-efficient\-intelligence](https://openai.com/blog/gpt-4o-mini-advancing-cost-efficient-intelligence)\.
- Pang & Lee \(2004\)Pang, B\. and Lee, L\.A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts\.In*Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics \(ACL\-04\)*, pp\. 271–278, Barcelona, Spain, July 2004\.doi:10\.3115/1218955\.1218990\.URL[https://aclanthology\.org/P04\-1035/](https://aclanthology.org/P04-1035/)\.
- Parrish et al\. \(2022\)Parrish, A\., Chen, A\., Nangia, N\., Padmakumar, V\., Phang, J\., Thompson, J\., Htut, P\. M\., and Bowman, S\. R\.BBQ: A hand\-built bias benchmark for question answering\.In Muresan, S\., Nakov, P\., and Villavicencio, A\. \(eds\.\),*Findings of the Association for Computational Linguistics: ACL 2022*, pp\. 2086–2105, Dublin, Ireland, May 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.findings\-acl\.165\.URL[https://aclanthology\.org/2022\.findings\-acl\.165/](https://aclanthology.org/2022.findings-acl.165/)\.
- Pawitan & Holmes \(2025\)Pawitan, Y\. and Holmes, C\.Confidence in the Reasoning of Large Language Models\.*Harvard Data Science Review*, 7\(1\), January 2025\.https://hdsr\.mitpress\.mit\.edu/pub/jaqt0vpb\.
- Plank \(2022\)Plank, B\.The “problem” of human label variation: On ground truth in data, modeling and evaluation\.In Goldberg, Y\., Kozareva, Z\., and Zhang, Y\. \(eds\.\),*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pp\. 10671–10682, Abu Dhabi, United Arab Emirates, December 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.emnlp\-main\.731\.URL[https://aclanthology\.org/2022\.emnlp\-main\.731/](https://aclanthology.org/2022.emnlp-main.731/)\.
- Qwen et al\. \(2025\)Qwen, Yang, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Li, C\., Liu, D\., Huang, F\., Wei, H\., Lin, H\., Yang, J\., Tu, J\., Zhang, J\., Yang, J\., Yang, J\., Zhou, J\., Lin, J\., Dang, K\., Lu, K\., Bao, K\., Yang, K\., Yu, L\., Li, M\., Xue, M\., Zhang, P\., Zhu, Q\., Men, R\., Lin, R\., Li, T\., Tang, T\., Xia, T\., Ren, X\., Ren, X\., Fan, Y\., Su, Y\., Zhang, Y\., Wan, Y\., Liu, Y\., Cui, Z\., Zhang, Z\., and Qiu, Z\.Qwen2\.5 Technical Report, January 2025\.URL[http://arxiv\.org/abs/2412\.15115](http://arxiv.org/abs/2412.15115)\.arXiv:2412\.15115 \[cs\]\.
- Santurkar et al\. \(2023\)Santurkar, S\., Durmus, E\., Ladhak, F\., Lee, C\., Liang, P\., and Hashimoto, T\.Whose opinions do language models reflect?In Krause, A\., Brunskill, E\., Cho, K\., Engelhardt, B\., Sabato, S\., and Scarlett, J\. \(eds\.\),*Proceedings of the 40th International Conference on Machine Learning*, volume 202 of*Proceedings of Machine Learning Research*, pp\. 29971–30004\. PMLR, 23–29 Jul 2023\.URL[https://proceedings\.mlr\.press/v202/santurkar23a\.html](https://proceedings.mlr.press/v202/santurkar23a.html)\.
- Sap et al\. \(2022\)Sap, M\., Swayamdipta, S\., Vianna, L\., Zhou, X\., Choi, Y\., and Smith, N\. A\.Annotators with attitudes: How annotator beliefs and identities bias toxic language detection\.In Carpuat, M\., de Marneffe, M\.\-C\., and Meza Ruiz, I\. V\. \(eds\.\),*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pp\. 5884–5906, Seattle, United States, July 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.naacl\-main\.431\.URL[https://aclanthology\.org/2022\.naacl\-main\.431/](https://aclanthology.org/2022.naacl-main.431/)\.
- Shaikh et al\. \(2023\)Shaikh, O\., Zhang, H\., Held, W\., Bernstein, M\., and Yang, D\.On second thought, let’s not think step by step\! bias and toxicity in zero\-shot reasoning\.In Rogers, A\., Boyd\-Graber, J\., and Okazaki, N\. \(eds\.\),*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 4454–4470, Toronto, Canada, July 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.acl\-long\.244\.URL[https://aclanthology\.org/2023\.acl\-long\.244/](https://aclanthology.org/2023.acl-long.244/)\.
- Shi et al\. \(2024\)Shi, W\., Ajith, A\., Xia, M\., Huang, Y\., Liu, D\., Blevins, T\., Chen, D\., and Zettlemoyer, L\.Detecting pretraining data from large language models\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=zWqr3MQuNs](https://openreview.net/forum?id=zWqr3MQuNs)\.
- Tan et al\. \(2025\)Tan, S\., Agrawal, L\. A\., Singhvi, A\., Lai, L\., Ryan, M\. J\., Klein, D\., Khattab, O\., Sen, K\., and Zaharia, M\.LangProBe: a language program benchmark\.In Christodoulopoulos, C\., Chakraborty, T\., Rose, C\., and Peng, V\. \(eds\.\),*Findings of the Association for Computational Linguistics: EMNLP 2025*, pp\. 21489–21509, Suzhou, China, November 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-335\-7\.doi:10\.18653/v1/2025\.findings\-emnlp\.1172\.URL[https://aclanthology\.org/2025\.findings\-emnlp\.1172/](https://aclanthology.org/2025.findings-emnlp.1172/)\.
- Tian et al\. \(2023\)Tian, K\., Mitchell, E\., Zhou, A\., Sharma, A\., Rafailov, R\., Yao, H\., Finn, C\., and Manning, C\. D\.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.pp\. 5433–5442, December 2023\.doi:10\.18653/v1/2023\.emnlp\-main\.330\.URL[https://aclanthology\.org/2023\.emnlp\-main\.330/](https://aclanthology.org/2023.emnlp-main.330/)\.
- Van Hee et al\. \(2018\)Van Hee, C\., Lefever, E\., and Hoste, V\.SemEval\-2018 task 3: Irony detection in English tweets\.In Apidianaki, M\., Mohammad, S\. M\., May, J\., Shutova, E\., Bethard, S\., and Carpuat, M\. \(eds\.\),*Proceedings of the 12th International Workshop on Semantic Evaluation*, pp\. 39–50, New Orleans, Louisiana, June 2018\. Association for Computational Linguistics\.doi:10\.18653/v1/S18\-1005\.URL[https://aclanthology\.org/S18\-1005/](https://aclanthology.org/S18-1005/)\.
- Weld et al\. \(2021\)Weld, H\., Huang, G\., Lee, J\., Zhang, T\., Wang, K\., Guo, X\., Long, S\., Poon, J\., and Han, C\.CONDA: a CONtextual dual\-annotated dataset for in\-game toxicity understanding and detection\.In Zong, C\., Xia, F\., Li, W\., and Navigli, R\. \(eds\.\),*Findings of the Association for Computational Linguistics: ACL\-IJCNLP 2021*, pp\. 2406–2416, Online, August 2021\. Association for Computational Linguistics\.doi:10\.18653/v1/2021\.findings\-acl\.213\.URL[https://aclanthology\.org/2021\.findings\-acl\.213/](https://aclanthology.org/2021.findings-acl.213/)\.
- Xiong et al\. \(2024\)Xiong, M\., Hu, Z\., Lu, X\., Li, Y\., Fu, J\., He, J\., and Hooi, B\.Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=gjeQKFxFpZ](https://openreview.net/forum?id=gjeQKFxFpZ)\.
- Yang et al\. \(2024\)Yang, D\., Tsai, Y\.\-H\. H\., and Yamada, M\.On verbalized confidence scores for LLMs\.*arXiv preprint arXiv:2412\.14737*, 2024\.URL[https://arxiv\.org/abs/2412\.14737](https://arxiv.org/abs/2412.14737)\.
- Zampieri et al\. \(2019\)Zampieri, M\., Malmasi, S\., Nakov, P\., Rosenthal, S\., Farra, N\., and Kumar, R\.SemEval\-2019 task 6: Identifying and categorizing offensive language in social media \(OffensEval\)\.In May, J\., Shutova, E\., Herbelot, A\., Zhu, X\., Apidianaki, M\., and Mohammad, S\. M\. \(eds\.\),*Proceedings of the 13th International Workshop on Semantic Evaluation*, pp\. 75–86, Minneapolis, Minnesota, USA, June 2019\. Association for Computational Linguistics\.doi:10\.18653/v1/S19\-2010\.URL[https://aclanthology\.org/S19\-2010/](https://aclanthology.org/S19-2010/)\.
- Zhang et al\. \(2026\)Zhang, J\., Yu, S\., Chong, D\., Sicilia, A\., Tomz, M\., Manning, C\. D\., and Shi, W\.Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity\.In*The Fourteenth International Conference on Learning Representations \(ICLR\)*, 2026\.URL[https://openreview\.net/forum?id=9jQkmGunGo](https://openreview.net/forum?id=9jQkmGunGo)\.
- Zhang et al\. \(2020\)Zhang, T\., Kishore, V\., Wu, F\., Weinberger, K\. Q\., and Artzi, Y\.BERTScore: Evaluating text generation with BERT\.In*International Conference on Learning Representations \(ICLR\)*, 2020\.URL[https://openreview\.net/forum?id=SkeHuCVFDr](https://openreview.net/forum?id=SkeHuCVFDr)\.
- Zheng et al\. \(2023\)Zheng, L\., Chiang, W\.\-L\., Sheng, Y\., Zhuang, S\., Wu, Z\., Zhuang, Y\., Lin, Z\., Li, Z\., Li, D\., Xing, E\., Zhang, H\., Gonzalez, J\. E\., and Stoica, I\.Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.In*Thirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2023\.URL[https://openreview\.net/forum?id=uccHPGDlao](https://openreview.net/forum?id=uccHPGDlao)\.

## Appendix ADefinition Texts Used

This section provides the exact definition texts given to models, organized by scope \(narrow to broad\)\. Understanding definition scope is critical for interpreting misalignment results\.

### A\.1Narrow Scope Definitions

These definitions require identity\-based targeting; general insults without identity targeting are explicitly excluded\.

#### Twitter Hate Speech\(Davidson et al\.,[2017](https://arxiv.org/html/2606.00467#bib.bib7)\)\.

> Hate Speech:Language that expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender\. Not Hate Speech:Content that does not express hatred or target groups based on identity attributes\. Note: Offensive language that contains profanity or insults but does not target identity groups is NOT considered hate speech\.

#### Fox News Hate Speech\.

> Hateful:Language that explicitly or implicitly threatens or demeans a person or a group based upon a facet of their identity such as gender, ethnicity, or sexual orientation\. This includes comments that express prejudice, discrimination, or hostility toward individuals or groups based on their protected characteristics\. Non\-hateful:Comments that do not contain language that threatens or demeans individuals or groups based on their identity characteristics\.

### A\.2Medium Scope Definitions

These definitions include insults and threats regardless of identity targeting, but still require explicit negative content\.

#### GameTox Multi\-Category Definition\.

> Toxic:Any message that falls into one or more of these categories: \(1\) Hate and Harassment: Utterances with identity\-based hate or harassment; \(2\) Threats: Utterances with threats of violence, physical harm, terrorism, or doxing; \(3\) Extremism: Utterances with extremist views or recruitment attempts; \(4\) Insults and Flaming: Insults or attacks on another player \(not identity\-based\); \(5\) Other Offensive Texts: Any message that harms reasonable enjoyment of the game\. Not toxic:Utterances without any toxicity from the above categories\.

#### OLID Offensive Language\(Zampieri et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib49)\)\.

> Offensive:This post contains offensive language or a targeted \(veiled or direct\) offense\. This includes any form of non\-acceptable language \(profanity\), insults, threats, and posts containing profane language or swear words\. Not Offensive:This post does not contain offense or profanity\. The content may express opinions or criticism but does not include any form of offensive language, insults, or threats\.

### A\.3Broad Scope Definitions

These definitions capture any content that is “rude,” “disrespectful,” or “discouraging” \- very broad criteria\.

#### Gaming Toxicity \(CONDA\)\(Weld et al\.,[2021](https://arxiv.org/html/2606.00467#bib.bib46)\)\.

> Toxic:Utterances containing explicit or implicit toxicity in gaming context\.*Explicit Toxicity:*Direct toxic expressions such as profanity, personal attacks, or insults directed at other players\.*Implicit Toxicity:*Subtler forms including sarcasm, veiled insults, or coded language that requires conversational context to identify as harmful\. Not Toxic:Utterances related to gameplay actions, neutral communication, or content that does not discourage player engagement\. General expressions of frustration not directed at specific players \(e\.g\., “damn\!” or “ugh”\) are not toxic\.

#### General Toxicity \(Perspective API/Jigsaw\)\(cjadams et al\.,[2018](https://arxiv.org/html/2606.00467#bib.bib6)\)\.

> A rude, disrespectful, or unreasonable comment that is likely to make someone leave a discussion\.

Table 9:Definition scope classification\.Narrowdefinitions require identity\-based targeting \(e\.g\., race, religion, gender\) to qualify as toxic, excluding general insults\.Mediumdefinitions include direct insults, threats, and profanity regardless of identity targeting\.Broaddefinitions capture any content that is rude, disrespectful, or discouraging, including implicit and contextual toxicity\. When applied to general toxicity tasks, narrow definitions produce under\-prediction while broad definitions produce over\-prediction\.

### A\.4Non\-Safety Task Definitions

The following definitions were used in the generalization experiments reported in Appendix[F](https://arxiv.org/html/2606.00467#A6)\. The irony and subjectivity definitions were used as the “aligned definition” prompt in the annotation experiments \(Tables[25](https://arxiv.org/html/2606.00467#A6.T25)and[26](https://arxiv.org/html/2606.00467#A6.T26)\); the five stance definitions were used as the elicitation context for DSF on stance detection\. All definitions operationalize the target concept following the framing of the original task papers and do not fit the narrow/medium/broad scope schema of Table[9](https://arxiv.org/html/2606.00467#A1.T9), which applies only to the toxicity datasets\.

#### Irony Detection \(SemEval\-2018 Task 3\)\(Van Hee et al\.,[2018](https://arxiv.org/html/2606.00467#bib.bib45)\)\.

> Ironic:A statement where the intended meaning is opposite to or incongruent with the literal meaning, often used to express humor, criticism, or sarcasm\. Irony involves a discrepancy between what is said and what is meant, or between expectations and reality\. Not Ironic:A statement where the intended meaning matches the literal meaning, with no incongruence between expression and intent\.

#### Subjectivity Detection\(Pang & Lee,[2004](https://arxiv.org/html/2606.00467#bib.bib34)\)\.

> Subjective:A sentence that expresses personal opinions, evaluations, emotions, or speculations\. Subjective sentences reflect the author’s point of view, feelings, or interpretations rather than objective facts\. Objective:A sentence that presents factual information that can be verified independently of the author’s personal opinion or emotional state\.

#### Stance Detection \(SemEval\-2016 Task 6\)\(Mohammad et al\.,[2016](https://arxiv.org/html/2606.00467#bib.bib29)\)\.

The stance task uses five targets, each with its own definition\. Unlike the other tasks in our study, stance is framed as a choice between two explicit alternatives \(*in favor*/*against*\) rather than as classification against a single concept definition, as discussed in Appendix[F](https://arxiv.org/html/2606.00467#A6)\.

> Stance toward Legalization of Abortion\.*In favor:*The tweet expresses support for or agreement with the legalization of abortion, including women’s right to choose\.*Against:*The tweet expresses opposition to the legalization of abortion, including pro\-life positions or moral objections\.

> Stance toward Atheism\.*In favor:*The tweet expresses support for, agreement with, or positive views toward atheism or non\-belief in God\.*Against:*The tweet expresses opposition to, criticism of, or negative views toward atheism\.

> Stance toward Climate Change\.*In favor:*The tweet expresses belief in or concern about climate change as a real and serious issue requiring action\.*Against:*The tweet expresses skepticism about, denial of, or opposition to claims about climate change\.

> Stance toward Feminist Movement\.*In favor:*The tweet expresses support for or agreement with the feminist movement and gender equality\.*Against:*The tweet expresses opposition to, criticism of, or negative views toward the feminist movement\.

> Stance toward Hillary Clinton\.*In favor:*The tweet expresses support for, agreement with, or positive views toward Hillary Clinton\.*Against:*The tweet expresses opposition to, criticism of, or negative views toward Hillary Clinton\.

## Appendix BPrompts

All prompts follow a consistent template structure\. The base classification prompt presents the task, followed by the text to classify, and ends with an “Answer:” suffix\. The confidence elicitation suffix \(Figure[2](https://arxiv.org/html/2606.00467#S3.F2)\) is appended to all classification prompts\.

#### Experimental Conditions\.

Table[10](https://arxiv.org/html/2606.00467#A2.T10)summarizes the 10 manually\-specified prompting conditions together with the 2 DSPy automated optimization conditions used in our experiments\.

Table 10:Description of experimental conditions\.
### B\.1DSPy Automated Optimization

We additionally evaluate two automated prompt optimization conditions using DSPy\(Khattab et al\.,[2023](https://arxiv.org/html/2606.00467#bib.bib18)\), which selects few\-shot demonstrations from a labeled validation pool \(of 100 examples\) to maximize accuracy\.DSPy Optimizeduses a bare classification signature with no human\-written definition \(automated counterpart tofew\_shot\)\.DSPy Alignedadditionally includes the dataset’s aligned definition in the instruction field \(automated counterpart tofew\_shot\_aligned\_def\)\. Optimization is run independently per model–dataset pair\.

### B\.2Continuation Prompt

To measure text\-level familiarity \(ROUGE\-L\), we split each text instance at the 40% mark by word count\. We prompt the model with the first 40% \(prefix\) using the following template:

> Continue the following passage faithfully: \[PREFIX\]

The model generates a continuation using greedy decoding \(temperature 0\.0\)\. We then compute the ROUGE\-L F1 score between the generated continuation and the remaining 60% of the ground truth text \(suffix\)\.

#### Additional memorization metrics\.

To confirm that ROUGE\-L’s limitations as a memorization proxy do not drive our findings, we compute two semantically rich alternatives over the same continuation/suffix pairs\. Both use theall\-MiniLM\-L6\-v2encoder—the same encoder as DSF—so any sign difference between memorization and DSF reflects signal type rather than encoder choice\.Embedding Similarityis the cosine similarity between the sentence embeddings of the generated continuation and the ground\-truth suffix\.BERTScore\(Zhang et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib51)\)computes token\-level F1 via pairwise cosine similarities of contextual token embeddings of the two texts\. Both metrics are averaged at the model–dataset level before correlation analysis\.

## Appendix CMixed Effects Regression Models

We fit four mixed effects logistic regression models to analyze annotation correctness and error correction\. All models include message\-level random intercepts to account for repeated measures across experimental conditions\. Models were estimated using Bayesian variational inference \(BinomialBayesMixedGLM in statsmodels\)\.

#### Model 1: Domain\-Alignment Effects

Research Question:What factors predict correct annotation?

Formula:correct∼\\simC\(model\) \+ C\(condition\) \+ C\(domain\) \+ is\_aligned\_defn \+ C\(domain\):is\_aligned\_defn \+ \(1\|message\)

Reference categories:Llama\-3\.1\-8B \(model\), Forum \(domain\), Misaligned definition \(condition\)

Table 11:Model 1: Full results for domain\-alignment effects\.PredictorCoefficientOR \[95% CI\]Intercept1\.5491\.5494\.714\.71\[4\.63, 4\.78\]Model Effects \(vs\. Llama\-8B\):DeepSeek\-V30\.4990\.4991\.651\.65\[1\.58, 1\.72\]Llama\-3\.1\-70B0\.2370\.2371\.271\.27\[1\.22, 1\.32\]Mistral\-7B0\.3840\.3841\.471\.47\[1\.42, 1\.52\]Mistral\-Small\-24B0\.3260\.3261\.391\.39\[1\.33, 1\.44\]Mixtral\-8x7B0\.4560\.4561\.581\.58\[1\.52, 1\.64\]Domain Effects \(vs\. Forum\):Gaming2\.6282\.62813\.8513\.85\[13\.31, 14\.41\]News−0\.048\-0\.0480\.950\.95\[0\.92, 0\.99\]Social Media1\.6951\.6955\.455\.45\[5\.27, 5\.63\]Alignment Effects:is\_aligned\_defn0\.0730\.0731\.081\.08\[1\.04, 1\.11\]Gaming×\\timesAligned0\.4190\.4191\.521\.52\[1\.40, 1\.65\]News×\\timesAligned0\.6450\.6451\.911\.91\[1\.80, 2\.02\]Social Media×\\timesAligned0\.3330\.3331\.401\.40\[1\.32, 1\.47\]Random Effects SD1\.1561\.1563\.183\.18\[3\.11, 3\.24\]
#### Model 2: Model\-Alignment Interaction

Research Question:Do different models benefit differently from aligned definitions?

Formula:correct∼\\simC\(model\) \+ C\(condition\) \+ C\(domain\) \+ is\_aligned\_defn \+ C\(model\):is\_aligned\_defn \+ \(1\|message\)

Table 12:Model 2: Model×\\timesAlignment interactions\.Llama\-3\.1\-70B benefits most from alignment \(OR=1\.55\), despite having the lowest rescue rate\. This suggests larger models have stronger priors that are harder to override, but can better utilize guidance when provided\.

#### Model 3: Zero\-Shot Predictor

Research Question:Does zero\-shot correctness predict prompted performance?

Formula:correct∼\\simC\(model\) \+ C\(condition\) \+ C\(domain\) \+ zero\_shot\_correct \+ \(1\|message\)

Table 13:Model 3: Zero\-shot as predictor of prompted correctness\.
#### Model 4: Rescue Prediction

Research Question:Among zero\-shot errors, what predicts whether prompting will correct the error?

Formula:rescued∼\\simC\(model\) \+ C\(condition\) \+ C\(domain\) \+ zs\_conf\_std \+ \(1\|message\)

This model is fit only on cases where the model was wrong in zero\-shot \(N=47,672 errors\)\.

Table 14:Model 4: Predicting error correction \(rescue\) among zero\-shot errors\.PredictorCoefficientOR \[95% CI\]ZS Confidence \(standardized\)−0\.171\-0\.1710\.840\.84\[0\.82, 0\.87\]Model Effects \(vs\. Llama\-8B\):DeepSeek\-V30\.6340\.6341\.891\.89\[1\.75, 2\.03\]Llama\-3\.1\-70B0\.1110\.1111\.121\.12\[1\.04, 1\.20\]Mistral\-7B−0\.071\-0\.0710\.930\.93\[0\.87, 1\.00\]Mistral\-Small\-24B0\.5140\.5141\.671\.67\[1\.57, 1\.79\]Mixtral\-8x7B0\.2960\.2961\.351\.35\[1\.25, 1\.45\]Domain Effects \(vs\. Forum\):Gaming0\.9820\.9822\.672\.67\[2\.47, 2\.89\]News0\.5800\.5801\.791\.79\[1\.68, 1\.89\]Social Media1\.3311\.3313\.783\.78\[3\.56, 4\.02\]

## Appendix DAdditional Experimental Results

Table 15:Detailed breakdown of model accuracy \(%\) under specific misaligned definitions\. These values are averaged into the single “Misaligned” column in Table[3](https://arxiv.org/html/2606.00467#S4.T3)\. Columns are ordered by definition scope \(Narrow→\\toBroad\)\. Models below the horizontal rule are not included in regression analyses; Cond\. AUC is for the six open\-weights models\.Table 16:Accuracy \(%\) by Dataset and Misaligned Condition\. Bold = best for that dataset\. “N/A” indicates the aligned \(non\-misaligned\) condition\.Table 17:Model sensitivity: best accuracy \(across all conditions\) minus worst accuracy under misaligned definitions\. Models below the horizontal rule are not included in regression analyses\.Table 18:Corruption rate \(%\) by model and misaligned condition\. High corruption indicates that correct predictions are frequently flipped to incorrect ones when the definition changes\. Models below the horizontal rule are not included in regression analyses\.Table 19:Rescue rate by model on the Jigsaw Unintended Bias dataset\(Borkan et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib2)\), which was designed to reduce identity\-group annotation bias\. Rescue rate isP\(correct∣prompted,zero\-shot wrong\)P\(\\text\{correct\}\\mid\\text\{prompted\},\\text\{zero\-shot wrong\}\), computed over all non\-zero\-shot conditions \(aligned definition, few\-shot, few\-shot\+definition, DSPy Optimized, DSPy Aligned, and five misaligned conditions\), matching the methodology of Table[6](https://arxiv.org/html/2606.00467#S4.T6)\. All models remain well below 50%, confirming that decision stickiness is not an artifact of Jigsaw’s known label\-bias limitations\.### D\.1Multi\-Turn Rescue Experiment

A natural follow\-up to the one\-shot rescue analysis \(Sec\.[4\.2](https://arxiv.org/html/2606.00467#S4.SS2)\) is whether iterative, history\-aware prompting can break decision stickiness\. For each model–dataset pair we sampled 100 zero\-shot errors and applied a 3\-step rescue sequence: Turn 1 adds a few labeled examples, Turn 2 adds the aligned task definition, and Turn 3 asks the model to reconsider\. At each turn the model is shown its previous answers and confidence scores, so later prompts see the full history\.

Table[20](https://arxiv.org/html/2606.00467#A4.T20)reports cumulative rescue rates by model, both overall and restricted to high\-confidence errors\. Iterative prompting helps modestly — particularly when the aligned definition is added at Turn 2 — but rescue plateaus well below the one\-shot rate of 34\.8% reported in Table[6](https://arxiv.org/html/2606.00467#S4.T6), and high\-confidence errors remain almost entirely uncorrected \(8\.5% rescued after three turns\)\.

Table 20:Cumulative rescue rates over three multi\-turn correction attempts\. “High\-conf” columns restrict to zero\-shot errors with confidence\>0\.9\>0\.9\.†78\.8% of sampled items dropped due to classification refusals\.

## Appendix EFamiliarity Analysis Details

### E\.1DSF Robustness Across Embedding Models

The DSF score is a cosine similarity in an embedding space, so its value depends on the encoder used\. To rule out that our results are an artifact of choosing any single embedding model, we recompute DSF using six diverse sentence\-embedding models spanning small \(MiniLM, mpnet\) to large \(bge\-large, e5\-large, instructor\-large\) open\-source encoders, plus a proprietary OpenAI model \(text\-embedding\-3\-small\)\. The full model identifiers are:

- •MiniLM:sentence\-transformers/all\-MiniLM\-L6\-v2
- •MPNet:sentence\-transformers/all\-mpnet\-base\-v2
- •BGE\-large:BAAI/bge\-large\-en\-v1\.5
- •E5\-large:intfloat/e5\-large\-v2
- •Instructor\-large:hkunlp/instructor\-large
- •OpenAI:text\-embedding\-3\-small

We then define*consensus DSF*for each model–dataset cell as the unweighted mean DSF across the six encoders, and adopt it as the primary DSF metric in the main text\.

The DSF analysis covers all nine evaluated models across six datasets \(the five primary datasets plus Jigsaw Unintended Bias\(Borkan et al\.,[2019](https://arxiv.org/html/2606.00467#bib.bib2)\)\), yieldingN=54N=54model–dataset pairs\. Model performance is reported in Table[3](https://arxiv.org/html/2606.00467#S4.T3)\.

#### Per\-embedding partial correlations\.

Table[21](https://arxiv.org/html/2606.00467#A5.T21)reports the partial correlation between DSF and zero\-shot accuracy \(controlling for dataset\) separately for each embedding choice\. All six embeddings yield a positive partial correlation, ranging from\+0\.30\+0\.30to\+0\.49\+0\.49\. Larger encoders \(bge\-large, e5\-large, OpenAI\) tend to yield slightly stronger correlations than the compact MiniLM\. The consensus row \(\+0\.41\+0\.41\) sits at the upper end of this range, indicating that averaging across encoders denoises rather than dilutes the signal\.

Table 21:DSF–accuracy correlations across six sentence\-embedding models on the extendedN=54N=54panel \(zero\-shot accuracy, partial correlation controls for dataset\)\. Consensus DSF is the per\-cell mean across embeddings, and is the metric reported in the main text\.
#### Pairwise embedding agreement\.

DSF values are highly concordant across embedding choices\. Table[22](https://arxiv.org/html/2606.00467#A5.T22)reports the Pearson correlation between DSF score vectors produced by different embedding models on theN=54N=54panel; all pairwise correlations exceed0\.870\.87, and most exceed0\.950\.95\. This indicates that the variation in Table[21](https://arxiv.org/html/2606.00467#A5.T21)reflects what each embedding emphasizes \(e\.g\., lexical vs\. semantic, general vs\. domain\-tuned\) rather than disagreement about which model–dataset pairs are high\- vs\. low\-DSF\.

Table 22:Pairwise Pearson correlations between DSF score vectors across six embedding models, on theN=54N=54panel\. All embeddings agree closely on the relative ordering of model–dataset cells\.
#### Takeaway\.

The DSF–accuracy association is not an artifact of a particular encoder\. Across six embedding models the partial correlation remains positive and meaningful, and the embeddings closely agree on which model–dataset pairs score high or low\. We therefore report consensus DSF in the main text as a stable, encoder\-agnostic operationalization of the metric\.

### E\.2Min\-K% Prob as a Memorization Baseline

Min\-K% Prob\(Shi et al\.,[2024](https://arxiv.org/html/2606.00467#bib.bib42)\)computes the average log probability of thek%k\\%least likely tokens in a text\. Lower \(more negative\) values indicate the model finds certain tokens surprising, suggesting the text was*not*memorized during training\. Computing Min\-K% requires per\-token log probabilities, which are unavailable for most frontier API models in our evaluation\. We therefore use ROUGE\-L as the primary memorization baseline, as it is computable via text generation for any model\.

To verify that this choice does not change our qualitative conclusions, we applied Min\-K% \(k=20%k=20\\%\) to the two open\-weights models in our panel for which log probabilities are available \(Llama\-3\.1\-8B and Mistral\-7B, base versions\) across the five paper datasets \(N=10N=10model–dataset pairs\)\. Table[23](https://arxiv.org/html/2606.00467#A5.T23)reports mixed\-effects model coefficients \(Accuracy∼\\simmetric \+\(1∣dataset\)\(1\\mid\\text\{dataset\}\)\+\(1∣model\)\(1\\mid\\text\{model\}\)\) for ROUGE\-L, Min\-K%, and DSF on this subset\. Min\-K% and ROUGE\-L are directionally consistent—both show negative standardized coefficients while DSF remains positive—confirming that our conclusions do not depend on the choice of memorization metric\.

Table 23:Mixed\-effects model coefficients for memorization metrics and DSF on theN=10N=10subset \(Llama\-3\.1\-8B and Mistral\-7B×\\times5 paper datasets\) where Min\-K% Prob is computable\. Standardized coefficients allow cross\-metric comparison of effect size\.
### E\.3Semantic Memorization Metrics: BERTScore and Embedding Similarity

ROUGE\-L captures lexical overlap via the longest common subsequence and may therefore underestimate memorization when a model reproduces the*content*of a passage with paraphrased surface form\. We additionally evaluate two semantically richer alternatives that operate on the same generated continuations as ROUGE\-L, so the comparison isolates the scoring function rather than the generation procedure\.

#### BERTScore\.

BERTScore\(Zhang et al\.,[2020](https://arxiv.org/html/2606.00467#bib.bib51)\)replacesnn\-gram matching with contextual token\-embedding similarity\. Each token in the generated continuationy^\\hat\{y\}and the reference suffixyyis encoded with a pretrained contextual encoder, and tokens are greedily aligned by maximum cosine similarity:

RBERT=1\|y\|∑yi∈ymaxy^j∈y^⁡𝐲i⊤𝐲^j,PBERT=1\|y^\|∑y^j∈y^maxyi∈y⁡𝐲i⊤𝐲^j,R\_\{\\text\{BERT\}\}=\\frac\{1\}\{\|y\|\}\\sum\_\{y\_\{i\}\\in y\}\\max\_\{\\hat\{y\}\_\{j\}\\in\\hat\{y\}\}\\mathbf\{y\}\_\{i\}^\{\\top\}\\hat\{\\mathbf\{y\}\}\_\{j\},\\quad P\_\{\\text\{BERT\}\}=\\frac\{1\}\{\|\\hat\{y\}\|\}\\sum\_\{\\hat\{y\}\_\{j\}\\in\\hat\{y\}\}\\max\_\{y\_\{i\}\\in y\}\\mathbf\{y\}\_\{i\}^\{\\top\}\\hat\{\\mathbf\{y\}\}\_\{j\},withF1,BERTF\_\{1,\\text\{BERT\}\}as their harmonic mean\. Because the embeddings are contextual, BERTScore credits paraphrases and near\-synonyms that ROUGE\-L would penalize\. We usemicrosoft/deberta\-xlarge\-mnlias the encoder \(the default recommended byZhang et al\. \([2020](https://arxiv.org/html/2606.00467#bib.bib51)\)for English\) and reportF1,BERTF\_\{1,\\text\{BERT\}\}as the primary score\.

#### Embedding cosine similarity\.

Embedding similarity moves one level higher, comparing whole\-sentence representations rather than token alignments:

EmbSim\(y^,y\)=ϕ\(y^\)⊤ϕ\(y\)‖ϕ\(y^\)‖‖ϕ\(y\)‖,\\text\{EmbSim\}\(\\hat\{y\},y\)\\;=\\;\\frac\{\\phi\(\\hat\{y\}\)^\{\\top\}\\phi\(y\)\}\{\\\|\\phi\(\\hat\{y\}\)\\\|\\,\\\|\\phi\(y\)\\\|\},whereϕ\(⋅\)\\phi\(\\cdot\)is a sentence encoder\. This is the loosest of the three notions of textual overlap: a continuation that captures the gist of the reference but rewords it freely can still score high\. To ensure that any difference between EmbSim and DSF reflects*what*is being measured \(continuation–reference alignment vs\. definition–self\-description alignment\) rather than*which encoder*does the measuring, we use the sameall\-MiniLM\-L6\-v2encoder as in DSF\.

#### Spectrum of memorization sensitivity\.

Taken together, ROUGE\-L, BERTScore, and EmbSim form a progression from purely lexical to fully semantic notions of reproduction\. If text memorization were driving annotation accuracy and ROUGE\-L were simply too crude to detect it, we would expect the correlation with accuracy to strengthen \(or at least become positive\) as we relax the lexical constraint\. Table[4](https://arxiv.org/html/2606.00467#S4.T4)shows the opposite: all three text\-memorization metrics yield negative partial correlations with accuracy after controlling for dataset, while DSF remains positive\. The negative direction is preserved across the full lexical–semantic spectrum, indicating that the issue is not ROUGE\-L’s lexical sensitivity but that text reproduction itself, however measured, does not predict annotation performance\.

### E\.4Definition\-Specific Familiarity \(DSF\) Variants

In the main text, we defineDefinition\-Specific Familiarity \(DSF\)as the cosine similarity between the embedding of the dataset’s ground\-truth definition and the model’s own description of the concept\. Below we detail this primary metric alongside two alternative variants we explored\.

#### DSF \(Primary Metric\)\.

This is the metric reported in the main results\. We prompt the model with a neutral instruction to describe the concept:“You are documenting your internal understanding of a labeling concept\. Concept: \[Concept Name\]\. Describe, in your own words, what content qualifies as this concept\.”We then compute the cosine similarity between the embedding of this generated response and the embedding of the dataset’s full definition\. We found this approach correlated most strongly with model performance\.

#### Alternative Variant: Positive DSF\.

This variant isolates the “positive” class definition \(e\.g\., whatistoxicity\) from the “negative” class\. It usesmulti\-prompt elicitationto reduce variance, averaging similarities across three prompt styles:Direct\(“You are documenting…”\),Expert\(“As an expert annotator…”\), andExamples\(“Without examples, describe…”\)\.

#### Alternative Variant: Enhanced DSF\.

This metric extends Positive DSF by addingdomain contextualization\(e\.g\.,“in online gaming chat”for GameTox\) andnegative class alignment\(comparing the model’s description of what the concept isnotto the dataset’s negative definition\)\. The final score is the average of the positive and negative alignment scores\.

Table 24:Familiarity metrics by dataset \(averaged across the six open\-weights models, zero\-shot\)\. DSF is reported as consensus DSF across six sentence encoders\. Jigsaw has the lowest DSF and is the hardest dataset; GameTox has the highest accuracy with negligible text memorization, illustrating that conceptual alignment, not memorization, tracks performance\.

## Appendix FGeneralization Beyond Toxicity

To test whether our two main findings—the positive DSF–accuracy association \(RQ1\) and decision stickiness \(RQ2\)—generalize beyond safety domains, we replicate the core analyses on two non\-safety binary classification tasks:*irony detection*\(SemEval\-2018 Task 3;Van Hee et al\.[2018](https://arxiv.org/html/2606.00467#bib.bib45)\) and*subjectivity detection*\(Pang & Lee,[2004](https://arxiv.org/html/2606.00467#bib.bib34)\)\. Both tasks use explicit human\-authored definitions and binary labels, directly analogous to our toxicity setup\. We evaluate the same 9 models from Table[2](https://arxiv.org/html/2606.00467#S3.T2), sampling 1,000 instances per dataset–model pair as in the main experiments\.

#### DSF replicates on non\-safety tasks\.

DSF shows a consistent positive association with zero\-shot accuracy across both non\-safety datasets \(partialr=\+0\.343r=\+0\.343,N=18N=18model–dataset pairs\), directionally replicating our toxicity finding \(partialr=\+0\.34r=\+0\.34on the original five datasets, see Section[4\.1](https://arxiv.org/html/2606.00467#S4.SS1)\)\. The effect is therefore not safety\-specific\.

#### Decision stickiness replicates on non\-safety tasks\.

The overall one\-shot few\-shot rescue rate is45\.0%45\.0\\%\(95%95\\%CI\[43\.2%,46\.8%\]\[43\.2\\%,46\.8\\%\]\) across3,0523\{,\}052zero\-shot errors, comparable in magnitude to the34\.8%34\.8\\%reported for toxicity \(Table[6](https://arxiv.org/html/2606.00467#S4.T6)\)\. More than half of zero\-shot errors remain uncorrected on these non\-safety tasks as well\. Table[25](https://arxiv.org/html/2606.00467#A6.T25)summarizes accuracy by condition \(averaged across the two datasets\); Table[26](https://arxiv.org/html/2606.00467#A6.T26)reports the per\-model rescue rates\.

Table 25:Accuracy \(%\) by condition on the two non\-safety datasets \(irony and subjectivity\), averaged across the two datasets\. Sampling and inference settings match the main experiments\.Table 26:Rescue rate by model in the few\-shot condition on the two non\-safety datasets \(irony and subjectivity\)\. Rescue rate isP\(correct∣few\-shot,zero\-shot wrong\)P\(\\text\{correct\}\\mid\\text\{few\-shot\},\\text\{zero\-shot wrong\}\), computed pooled across the two datasets\.
#### Stance detection: a structurally different task\.

As a further check, we ran a preliminary experiment on stance detection \(SemEval\-2016 Task 6;Mohammad et al\.[2016](https://arxiv.org/html/2606.00467#bib.bib29)\) across 5 targets and 8 models\. The DSF direction remains positive \(partialr=\+0\.295r=\+0\.295,N=40N=40target–model pairs\)\. Stance detection differs structurally from our other tasks, however: it requires choosing between two explicit alternatives \(“in favor” / “against”\) rather than classifying content against a single concept definition, which may introduce additional sources of variation beyond conceptual alignment around the definition itself\. We therefore treat stance as directionally consistent but not as clean a test of the DSF–accuracy relationship as irony and subjectivity, and we report it here only as supporting evidence\.

#### Takeaway\.

Both core findings replicate outside the toxicity domain: DSF remains positively associated with zero\-shot accuracy, and most zero\-shot errors continue to resist correction even when models are given aligned definitions or few\-shot examples\. The effects we identify in the main paper appear to be properties of definition\-driven annotation under internalized priors rather than artifacts of the safety setting\.

## Appendix GModel and Dataset Use Across Experiments

Our experiments use a two\-tier structure for both models and datasets, summarized in Table[27](https://arxiv.org/html/2606.00467#A7.T27)\.

The main\-paper descriptive RQ2/RQ3 result tables and figures \(Tables[6](https://arxiv.org/html/2606.00467#S4.T6),[7](https://arxiv.org/html/2606.00467#S4.T7),[8](https://arxiv.org/html/2606.00467#S4.T8)and Figs\.[3](https://arxiv.org/html/2606.00467#S4.F3),[4](https://arxiv.org/html/2606.00467#S4.F4)\) report values aggregated across all nine evaluated models \(Llama\-3\.1\-8B, Llama\-3\.1\-70B, Llama\-3\.3\-70B, Mistral\-7B, Mistral\-Small\-24B, Mixtral\-8x7B, DeepSeek\-V3, GPT\-4o\-mini, Qwen\-2\.5\-72B\)\. All mixed\-effects regression analyses \(Table[5](https://arxiv.org/html/2606.00467#S4.T5)in the main paper and Tables[11](https://arxiv.org/html/2606.00467#A3.T11)–[14](https://arxiv.org/html/2606.00467#A3.T14)in the appendix\) and the multi\-turn rescue analysis \(Table[20](https://arxiv.org/html/2606.00467#A4.T20)\) are scoped to the original six open\-weights models \(Llama\-3\.1\-8B, Llama\-3\.1\-70B, Mistral\-7B, Mistral\-Small\-24B, Mixtral\-8x7B, DeepSeek\-V3\) to keep statistical inference tied to the original six\-model design; GPT\-4o\-mini, Llama\-3\.3\-70B, and Qwen\-2\.5\-72B contribute to the descriptive and performance tables but are excluded from the mixed\-effects regressions\.

The dataset side follows the same pattern\. The five primary toxicity datasets \(Twitter Hate, OLID, GameTox, Fox News, Jigsaw Toxic Comments\) are the analytic core and appear in every result table\. The three supplementary datasets — Jigsaw Unintended Bias, SemEval Irony, Subjectivity — are used only in the specific analyses indicated below: Jigsaw Unintended Bias as a label\-bias robustness check for the rescue and familiarity results, irony and subjectivity as out\-of\-domain replications of RQ1 and RQ2 \(Appendix[F](https://arxiv.org/html/2606.00467#A6)\)\.

Table 27:Coverage patterns used across the paper\. “9 models” = all nine evaluated models \(Llama\-3\.1\-8B, Llama\-3\.1\-70B, Llama\-3\.3\-70B, Mistral\-7B, Mistral\-Small\-24B, Mixtral\-8x7B, DeepSeek\-V3, GPT\-4o\-mini, Qwen\-2\.5\-72B\)\. “6 core” = the original six open\-weights subset \(Llama\-3\.1\-8B, Llama\-3\.1\-70B, Mistral\-7B, Mistral\-Small\-24B, Mixtral\-8x7B, DeepSeek\-V3\)\. “5 main” = the five primary toxicity datasets; Jigsaw UB = Jigsaw Unintended Bias\.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Similar Articles

Can LLMs Take Retrieved Information with a Grain of Salt?

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

LLMs for automatic annotation of Mandarin narrative transcripts

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Submit Feedback

Similar Articles

Can LLMs Take Retrieved Information with a Grain of Salt?
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
LLMs for automatic annotation of Mandarin narrative transcripts
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Learning, Fast and Slow: Towards LLMs That Adapt Continually