The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

arXiv cs.CL Papers

Summary

The Ghost Annotator framework combines conformal prediction with collaborative filtering to model LLM behavior and human label variation in content moderation, revealing structural demographic biases in larger models.

arXiv:2606.02911v1 Announce Type: new Abstract: Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:35 AM

# The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction
Source: [https://arxiv.org/html/2606.02911](https://arxiv.org/html/2606.02911)
Mirko Lai2,3,Alessandra Urbinati1,Simona Frenda4,3, Fabiana Vernero5,Marco Antonio Stranisci5,3

1Laboratory for the Modeling of Biological and Socio\-technical Systems, Northeastern University, Boston, MA, USA,2Heriot\-Watt University, Edinburgh, Scotland, 3aequa\-tech, Torino, Italy,4Università del Piemonte Orientale, Vercelli, Italy, 5Università degli Studi di Torino, Torino, Italy Correspondence:[marcoantonio\.stranisci@unito\.it](https://arxiv.org/html/2606.02911v1/mailto:[email protected])

###### Abstract

Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data\. We introduce a framework combining conformal prediction with Collaborative Filtering\-style annotators’ representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement\. Using Non\-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations\. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes\. We evaluated four LLMs of different size and families across four content moderation datasets\. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation\. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora\.

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Mirko Lai2,3, Alessandra Urbinati1, Simona Frenda4,3,Fabiana Vernero5,Marco Antonio Stranisci5,31Laboratory for the Modeling of Biological and Socio\-technical Systems, NortheasternUniversity, Boston, MA, USA,2Heriot\-Watt University, Edinburgh, Scotland,3aequa\-tech, Torino, Italy,4Università del Piemonte Orientale, Vercelli, Italy,5Università degli Studi di Torino, Torino, ItalyCorrespondence:[marcoantonio\.stranisci@unito\.it](https://arxiv.org/html/2606.02911v1/mailto:[email protected])

## 1Introduction

Human Label Variation \(HLV\)Plank \([2022](https://arxiv.org/html/2606.02911#bib.bib51)\)recently emerged as a research paradigm aimed to enhance the fairness and inclusivity of language technologies and resources\. In overcoming traditional approaches based on label aggregation, HLV motivates a shift toward datasets and modelsUmaet al\.\([2021](https://arxiv.org/html/2606.02911#bib.bib82)\); Cabitzaet al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib81)\)that aim to capture different perspectives, especially on highly subjective phenomenaFrendaet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib21)\)\. This shift has important theoretical and practical implications, as biased technologies can lead to systematic harms for specific groups in downstream tasks such as automatic content moderationKocońet al\.\([2021a](https://arxiv.org/html/2606.02911#bib.bib83)\); Sapet al\.\([2022b](https://arxiv.org/html/2606.02911#bib.bib40)\); Anandet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib52)\)\.

The Natural Language Processing \(NLP\) community addresses HLV from a wide range of perspectives: ranging from the development of disaggregated data with annotators’ metadata\(Sachdevaet al\.,[2022](https://arxiv.org/html/2606.02911#bib.bib53); Mostafazadeh Davaniet al\.,[2024b](https://arxiv.org/html/2606.02911#bib.bib37)\), to methods for modeling and capturing diverse worldviewsWichet al\.\([2021](https://arxiv.org/html/2606.02911#bib.bib50)\); Van Der Meeret al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib72)\)to better represent minoritized groupsVitsakiset al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib68)\)\. However, there are still open challenges, the most relevant being:i\.The generalization of findings on HLV is hindered by the mismatch between datasets and their annotation schemesFortuna and Nunes \([2018](https://arxiv.org/html/2606.02911#bib.bib86)\); Vidgen and Derczynski \([2020](https://arxiv.org/html/2606.02911#bib.bib85)\);ii\.much of the existing research focuses on model performance, with comparatively less attention to uncertainty, which is becoming central with the growing adoption of Large Language Models \(LLM\) to generate annotated datasetsTanet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib54)\)\.

Our work tackles these challenges by presenting the Ghost Annotator, a representation of LLM behavior derived from conformal prediction scores, used to analyze similarity with human annotators\. Our approach builds on Conformal Prediction\(Chenet al\.,[2023](https://arxiv.org/html/2606.02911#bib.bib46)\), a methodology for models’ uncertainty estimation, to profile groups of annotators and identify which annotator groups the model is most similar to\.

Through the design of the Ghost Annotator we answer the following questions:

\[RQ1\]Is there a relationship between models’ uncertainty and HLV expressed in disaggregated corpora?

\[RQ2\]Do models align with specific categories of annotators?

Our results indicate that larger LLMs exhibit higher confidence in their predictions while diverging more substantially from human annotations than smaller models\. Despite these differences, all models display confidence patterns that reflect collective annotator behavior: as disagreement among annotators increases for a given message, model uncertainty correspondingly increases, in line with previous findings from previous worksSchmeisser\-Nietoet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib55)\); Anandet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib52)\)\. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment against specific socio\-demographic groups, suggesting that structural bias in pretraining corpora that are shared by models of different sizes and families\.111The code for our experiments is available at:[https://anonymous\.4open\.science/r/ghost\-annotator\-825C/README\.md](https://anonymous.4open.science/r/ghost-annotator-825C/README.md)

## 2Related Work

The annotators’ individual characteristics affect the text perception\.Mieleszczenko\-Kowszewiczet al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib22)\)examined how the psychological and emotional traits of 40 annotators across different tasks and texts determine the perception of text also over time\. The human instability and diversity make, in general, the reproduction of their annotation hard\. However, to lower the annotation time and costs, the use ofpre\-trained models for creating dataset, simulating human activities and evaluating models’ outputs is increasingTanet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib54)\); Aheret al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib32)\); Liet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib34)\)222LLM\-as\-a\-judge is used also in available evaluation frameworks that score the bias of LLMs:[https://deepeval\.com/](https://deepeval.com/)\. This raises the need to evaluate their reliability in replacing humansCalderonet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib33)\); Gligorićet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib36)\), and guarantee a degree of diversity in their annotations\. Besides the common approaches based on active learning to optimize the annotation budgetWang and Plank \([2023](https://arxiv.org/html/2606.02911#bib.bib56)\), some techniques that account for HLV through selection criteria of examples and annotators were proposedBaumleret al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib57)\); van der Meeret al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib58)\)\. But,Gruberet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib35)\)argue that these techniques do not consider the distinction between HLV and annotation error and that in general LLMs are preferred because they can automatically provide label distributions\. In this scenario, however, LLMs\-as\-annotators tend to perform better on English datasets, are biased toward annotating texts as offensive and abusive, produce label distributions not aligned with human opinion distributionsPavlovic and Poesio \([2024a](https://arxiv.org/html/2606.02911#bib.bib59)\), and even if prompted with diverse personas, struggle to generate responses as diverse as humansSarumiet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib78)\); Lanet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib19)\)\.

Among scholars who studied thecorrelation between model prediction and distinct human responses,Schmeisser\-Nietoet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib55)\)andAnandet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib52)\)demonstrated how models exhibit low confidence when annotators have more disagreement with each other\. Disagreement can be caused by different factorsSandriet al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib60)\); Wanet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib28)\); Frendaet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib21)\), and especially in tasks like hate speech detection, beliefs, identities and demographics are correlated with the level of toxicity and offensive language perceived in a messageSapet al\.\([2022a](https://arxiv.org/html/2606.02911#bib.bib61)\); Mostafazadeh Davaniet al\.\([2024a](https://arxiv.org/html/2606.02911#bib.bib62)\)\. If the HLV is not captured by datasets and models, the result is unfair model behavior \(e\.g\., discrimination of minorities, reinforcement of stereotypes, or eclipsing of segments of the population\)\. To investigate the presence of biases in pre\-trained models, various scholars explored the use of questionnaires, evaluation frameworks, and word association tests with the purpose of unveiling their political or value preference and moral attitudeWrightet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib63)\); Jianget al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib29)\); Raoet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib64)\); Abramskiet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib20)\); Daiet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib30)\)\. All these studies reveal how, unfortunately, LLMs are not suitable for a global audience\.

Inspired by the work ofUrbinatiet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib31)\), we useconformal predictionto estimate the uncertainty of models towards human annotations\. The novelty of our work is a new framework that examines models’ correlation with HLV and helps to position their representation, in terms ofGhost Annotator, across diverse sociodemographic axes\. Recently introduced in NLP\(Chenet al\.,[2023](https://arxiv.org/html/2606.02911#bib.bib46)\), previous studies exploited conformal prediction to trigger moderators’ review in automatic hateful content moderationVillate\-Castilloet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib23)\), estimate models’ uncertainty in text generation\(Wanget al\.,[2025](https://arxiv.org/html/2606.02911#bib.bib71)\), machine translation\(Zerva and Martins,[2024](https://arxiv.org/html/2606.02911#bib.bib49)\), and text classification\(Shenget al\.,[2025](https://arxiv.org/html/2606.02911#bib.bib73)\), and clean mislabeled data based on a small curated calibration setZhanet al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib25)\)\. With our work, we provide a fair framework, based on a statistically guaranteed techniqueCamposet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib27)\), to evaluate and use conscientiously pre\-trained models in the creation and augmentation of training datasets, ensuring diverse annotations\.

## 3Experimental Setting

In this section we present the experimental setting that drives our research\. In Section[3\.1](https://arxiv.org/html/2606.02911#S3.SS1)we present Conformal Prediction, which is used to estimate models uncertainty against human annotations\. In Section[3\.2](https://arxiv.org/html/2606.02911#S3.SS2)we describe the Ghost Prediction, an alternative to accuracy\-based metrics that is used to quantify models divergence from disaggregated human annotators\. In Section[3\.3](https://arxiv.org/html/2606.02911#S3.SS3)we describe the Ghost Annotator, a framework to profile models and human annotators inspired by Collaborative Filtering and built upon Conformal Prediction and Ghost Predictions\. Sections[3\.4](https://arxiv.org/html/2606.02911#S3.SS4)and[3\.5](https://arxiv.org/html/2606.02911#S3.SS5)respectively present the datasets and models that we adopted in our experiment\.

### 3\.1Conformal Prediction

Conformal Prediction\(Angelopouloset al\.,[2023](https://arxiv.org/html/2606.02911#bib.bib43); Fontanaet al\.,[2023](https://arxiv.org/html/2606.02911#bib.bib44)\)is a framework for producing calibrated uncertainty estimates by associating predictions with non\-conformity scores derived from a held\-out calibration set\. From this calibration procedure, we derive a Non\-Conformity Score \(NCS\) \(Eq\.[4](https://arxiv.org/html/2606.02911#A1.E4)in Appendix[A](https://arxiv.org/html/2606.02911#A1)\), which quantifies how unusual a prediction is with respect to the calibration distribution\. The core idea behind Conformal Prediction is that it is possible to calibrate a model by computing its average NCS on a limited set of data \(the calibration set\) and then use this score to assess the uncertainty of model’s predictions on unseen data\. To ensure comparability across datasets and models, NCS values are normalized within each dataset using their empirical calibration distributions\. We compute Non\-Conformity Scores at the level of individual model–annotator–instance interactions \(Eq\.[3](https://arxiv.org/html/2606.02911#A1.E3)in Appendix[A](https://arxiv.org/html/2606.02911#A1)\)\. This yields a set of NCS values for each annotated example, rather than a single aggregated score\. The resulting collection of scores forms an empirical distribution that we use to characterize both annotators and models\.

In this work, we use Conformal Prediction to derive uncertainty scores that serve as the basis for constructing model–annotator interaction representations in order to identify patterns of statistical divergence between model predictions and human annotations\. Specifically, we use the NCS as a measure of divergence between model predictions and human annotations, acknowledging that it reflects statistical misalignment rather than causal bias\. This approach is extremely flexible because it can be adopted to capture individual preferences or group dynamics by partially aggregating annotators\.

### 3\.2Ghost Prediction

Commonly the model evaluation in classification tasks relies on the accuracy performance based on the comparison between model predictions and theground truthobtained aggregating human labels or their distributionLeonardelliet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib74)\)\. Recently, some methods of evaluation that take into account HLV were proposed\. These consider the comparison of model’s predictions with annotators’ labels grouped by similar profilesAkhtaret al\.\([2021](https://arxiv.org/html/2606.02911#bib.bib38)\); Gordonet al\.\([2022](https://arxiv.org/html/2606.02911#bib.bib76)\), and with individual annotators’ labelsMostafazadeh Davaniet al\.\([2022](https://arxiv.org/html/2606.02911#bib.bib65)\); Mokhberianet al\.\([2024](https://arxiv.org/html/2606.02911#bib.bib66)\); Orlikowskiet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib75)\); Loet al\.\([2025b](https://arxiv.org/html/2606.02911#bib.bib77)\)\. Moreover, all these works mainly rely on the computation of accuracy\-based metrics \(e\.g\., F1 score, MAE\)\. Inspired by works on human bias investigationKocońet al\.\([2021b](https://arxiv.org/html/2606.02911#bib.bib80)\); Mieleszczenko\-Kowszewiczet al\.\([2023](https://arxiv.org/html/2606.02911#bib.bib22)\)and differently from previous works on LLMs bias measurement \(see Section[2](https://arxiv.org/html/2606.02911#S2)\), we introduce theGhost Predictionmetric\. Overcoming the evaluation of model outputs in terms of performance, we define Ghost Prediction as the proportion of instances where the model predicts a label that is not present among any human annotations for the same item\.

G​P=1N​∑i=1N𝕀​\(ym\(i\)∉Yh\(i\)\)GP=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\(y\_\{m\}^\{\(i\)\}\\notin Y\_\{h\}^\{\(i\)\}\\right\)\(1\)whereym\(i\)y\_\{m\}^\{\(i\)\}denotes the label predicted by the model for instanceii, andYh\(i\)Y\_\{h\}^\{\(i\)\}represents the set of labels provided by human annotators for the same instance\.

### 3\.3Annotators Representation through Conformal Prediction\-based Embeddings

To capture individual behavioral patterns across instances, we design annotator representations taking inspiration from the Collaborative Filtering \(CF\) framework\. CF is probably the most popular technique in the area of recommender systems\(Ricciet al\.,[2022](https://arxiv.org/html/2606.02911#bib.bib9); Schaferet al\.,[2007](https://arxiv.org/html/2606.02911#bib.bib11)\), i\.e\., software tools which generate personalized suggestions \(recommendations\) promoting items that are most likely to match the needs, preferences or interests of a certain user\(Burkeet al\.,[2011](https://arxiv.org/html/2606.02911#bib.bib10)\)\(thetarget\), with the aim of mitigating the so\-calledinformation overloadproblem\(Maes,[1994](https://arxiv.org/html/2606.02911#bib.bib12)\)\. The original version of CF, also known asuser\-basedCF, draws on the idea that users who agreed on their evaluations for some items in the past are likely to agree on others too: hence, this approach generates recommendations based on items liked by other users with similar tastes, namely, with a similar rating history\(Goldberget al\.,[1992](https://arxiv.org/html/2606.02911#bib.bib13)\)\.

The CF analogy maps naturally onto the content moderation setting: platforms correspond to recommender systems, moderators or annotators to users, content items to items, and severity judgments to ratings\. Under this mapping, the Ghost Annotator can be interpreted as identifying which human moderators a model least resembles\. In a deployed pipeline this information could guide the routing of uncertain or divergent cases toward the reviewers whose judgment the model is least likely to replicate\.

The representation design inherits three practically important properties from this inspiration\. First, it handles heterogeneity in annotation volume naturally: unlike approaches requiring each annotator to label the same items, it operates effectively when annotators have covered different subsets of the data, which is the norm in large\-scale crowdsourced corpora\. Second, it is compatible with disaggregated annotations, operating directly on per\-annotator and per\-instance scores rather than collapsed majority labels, preserving the individual variation that HLV research aims to capture\. Third, because annotator representations are built from NCS distributions rather than raw label sequences, they are not tied to the specific items of a single corpus, enabling meaningful comparisons of annotator and model behavior across different datasets, as demonstrated in the cross\-corpus analysis in Section[4\.2](https://arxiv.org/html/2606.02911#S4.SS2)\.

In this work we construct annotator representations based on the distribution of Non\-Conformity Scores \(NCS\) produced by model–annotator interactions\. Our methodology relies on the following assumptions:

1. 1\.we model the interaction between an LLM and a human annotation as a scoring process that yields an NCS;
2. 2\.each interaction produces an NCS, interpreted as a measure of mismatch between model prediction and human annotation;
3. 3\.each annotator is represented by the distribution of NCS values obtained from all items they annotated;
4. 4\.the model is represented by the distribution of NCS values computed on instances where its prediction differs from all human annotations \(Ghost Predictions\) \(Section[3\.2](https://arxiv.org/html/2606.02911#S3.SS2)\)\.

In order to adequately compare annotators that labeled different amounts of messages, we represent them as a 3\-dimensional vector derived from the quartiles of the NCS distributions\.

The model is represented as a 3\-dimensional vector, as well\. Since the model’s representation is based on the NCSs of its ghost predictions, we define theGhost AnnotatorGm→\\vec\{G\_\{m\}\}as a vector representation of the model in the annotator space\. It is constructed by aggregating the quartiles of the Non\-Conformity Score \(NCS\) distributions derived from Ghost Predictions \(Eq\.[5](https://arxiv.org/html/2606.02911#A1.E5)in Appendix[A](https://arxiv.org/html/2606.02911#A1)\)\. This representation does not correspond to a real or synthetic human annotator, but to a geometric embedding of model behavior\.

We compute cosine similarity between the model embeddingGm→\\vec\{G\_\{m\}\}and annotator embeddings to measure relative alignment in the annotator space\.

We adopt this methodology to systematically explore the alignment of models with specific categories of annotators in perceiving relevant phenomena for content moderation \(e\.g\., Hate Speech, offensiveness\)\.

### 3\.4Datasets

DatasetAvg\. Ann\.Rel\. Maj\.Abs\. Maj\.Qual\. Maj\.Unan\.Label Fitting \(avg\)Isolation \(avg\)Attitudes \(Hate Speech\)\(Sapet al\.,[2022b](https://arxiv.org/html/2606.02911#bib.bib40)\)5\.5230\.6320\.0380\.3110\.0190\.5780\.350CADE \(Acceptability\)\(Loet al\.,[2025a](https://arxiv.org/html/2606.02911#bib.bib39)\)5\.7000\.3430\.2060\.3370\.1140\.5050\.325Disentangling \(Offensiveness\)\(Mostafazadeh Davaniet al\.,[2024b](https://arxiv.org/html/2606.02911#bib.bib37)\)32\.3240\.5990\.2170\.1790\.0050\.9160\.472MHS \(Violence\)\(Sachdevaet al\.,[2022](https://arxiv.org/html/2606.02911#bib.bib53)\)5\.8560\.3540\.0030\.2800\.3630\.3510\.261Table 1:Description of datasets according to the following axes\.Majorities: percentage of relative majority \(<=0\.50<=0\.50\), absolute majority \(0\.50<x<0\.660\.50<x<0\.66\), qualified majority \(0\.66<x<10\.66<x<1\)\.Label fitting: all the labels chosen by at least one annotator / all the possible labels;Isolation: % of times in which an annotator diverges from majority\.We chose four datasets annotated for topics related to content moderation, in order to assess the generalization of our method across different phenomena\. We followed two guiding principles for data selection to ensure comparability between corpora:i\.we only selected datasets with a scalar annotation scheme to ensure a coherent scheme across them;ii\.we did not include datasets provided by the same research group to avoid research bias\(Hovy and Prabhumoye,[2021](https://arxiv.org/html/2606.02911#bib.bib41)\)and maximize their reciprocal independence\. The benchmark includes the following datasets:

Attitudes\(Sapet al\.,[2022b](https://arxiv.org/html/2606.02911#bib.bib40)\): a corpus of627627tweets annotated forHate Speech\(HS\) detection on a scale from11to55\.

CADE\(Loet al\.,[2025a](https://arxiv.org/html/2606.02911#bib.bib39)\): a corpus of2,0942,094YouTube comments ranked on the basis of theirunacceptabilityon a scale from11to44\.

MHS\(Sachdevaet al\.,[2022](https://arxiv.org/html/2606.02911#bib.bib53)\): a corpus of39,46139,461tweets annotated according to a multidimensional annotation scheme on a scale from0to44\. For this study we focused on the axis ofviolence

Three types of descriptive statistics have been extracted from each dataset to identify different and common features between them\.

#### Distribution of majority types\.

Inspired by existing work ofLeonardelliet al\.\([2021](https://arxiv.org/html/2606.02911#bib.bib42)\), we described each message according to the type of majority formed by annotators: unanimity \(x=1x=1\), qualified majority \(0\.66<x<10\.66<x<1\), absolute majority \(0\.5<x<0\.660\.5<x<0\.66\), and relative majority \(x<=0\.5x<=0\.5\)\. As it can be observed in Table[1](https://arxiv.org/html/2606.02911#S3.T1), the distribution of majority types significantly differ between datasets suggesting divergent annotation behaviors across datasets\.

#### Label fitting and average number of annotators\.

This statistics describe the average percentage of scalar values that have been selected by at least one annotator\. Excluding Disentangling, whose average number of32\.232\.2annotatorspermessage causes a very high label fitting, differences also emerge between datasets with a comparable average number of annotators\.

#### Annotators Isolation\.

This statistic reports the average percentage of annotators to label a message in contrast with the majority\. Coherently with the high number of annotationspermessage, Disentangling has the highest annotation isolation but differences also arise between the other corpora\.

### 3\.5Models

In this experiment, a selection of pre\-trained language models was employed to tackle a text annotation task, aimed at classifying social media posts based on the presence of harmful content \(the prompts and the experimental setup are reported in Appendix[B](https://arxiv.org/html/2606.02911#A2)and[C](https://arxiv.org/html/2606.02911#A3)\)\. The focus is identifying different categories of harmful content such as violence, hate speech, acceptability, and offensiveness\.

The chosen models were tasked with generating probabilities for specific labels and calculating the NCS\. Two families of LLMs, Qwen and Llama, were selected for benchmarking performance across different model scales\. Therefore, we employ two smaller models \(i\.e\., Qwen/Qwen2\.5\-1\.5B\-Instruct, Meta\-llama/Llama\-3\.2\-1B\-Instruct\), one from the Qwen family and one from the Llama family, along with their respective medium\-sized counterparts \(i\.e\., Qwen/Qwen2\.5\-7B\-Instruct, Meta\-llama/Llama\-3\.1\-8B\-Instruct\)\. In particular, these models were selected for their ability to understand instructions and generate responses tailored to classification tasks\.

## 4Results

In this section we present the results of our experiments\. Section[4\.1](https://arxiv.org/html/2606.02911#S4.SS1)presents results about the comparison between models’ uncertainty and HLV \(RQ1\); Section[4\.2](https://arxiv.org/html/2606.02911#S4.SS2)describes the impact of datasets and models in the alignment of LLMs with specific groups of annotators \(RQ2\)\.

### 4\.1\[RQ1\] Is there a coherence between HLV and models’ uncertainty?

Our first experiment is aimed at exploring the coherence between LLMs uncertainty and collective behaviors in the context of dataset annotation\. We jointly study the average NCS of LLMs across datasets \(Section[3\.1](https://arxiv.org/html/2606.02911#S3.SS1)\) and their tendency to output ghost predictions \(Section[3\.2](https://arxiv.org/html/2606.02911#S3.SS2)\)\. We observe whether there is a pattern between different majority types emerging between human annotators and models uncertainty\.

#### Scale\-dependent confidence patterns in ghost predictions

The average conformity score of models across datasets on all predictions \(Figure[1](https://arxiv.org/html/2606.02911#S4.F1), Top\) shows that smaller models are more confident about their predictions, regardless their LLM family and the type of datasets\. Excluding Qwen2\.5\-1\.5B on the CADE dataset, the conformity score of smaller LLMs exhibit lower variation than their counterparts\. When only the Ghost Predictions are considered, namely predictions that does not have any correspondence with human annotators \(Section[3\.2](https://arxiv.org/html/2606.02911#S3.SS2), larger models exhibit higher confidence, as it can be observed in Figure[1](https://arxiv.org/html/2606.02911#S4.F1)\(Bottom\)\. These results seem to be counter\-intuitive as larger models are expected to be more aligned with human preferences and values\. One possible explanation is that larger pretraining corpora encode stronger majority\-culture norms, producing more confident internal representations precisely in regions where those norms diverge from minoritized perspectives\. This interpretation is consistent with the demographic misalignment reported in Section[4\.2](https://arxiv.org/html/2606.02911#S4.SS2), where all models systematically diverge from SSA annotators regardless of size — suggesting that both findings reflect the same upstream bias rather than model\-specific artifacts\.

![Refer to caption](https://arxiv.org/html/2606.02911v1/x1.png)
![Refer to caption](https://arxiv.org/html/2606.02911v1/figures/Figure_2_bottom.png)

Figure 1:\(Top\) Box Plot showing the uncertainty of models on all predicted labels across datasets\. \(Bottom\) Box Plot showing the uncertainty of models on Ghost Predictions, namely all predicted labels that were not chosen by any human annotators\.
#### Model uncertainty is not fully coherent with annotation isolation\.

We observe an inverse relationship between high model uncertainty and the proportion of Ghost Annotators, suggesting that model uncertainty does not fully mirror human uncertainty\.

The relationship between model uncertainty and human disagreement becomes stronger in datasets where ghost predictions are less frequent\. In these settings, ghost predictions appear to emerge in more structurally ambiguous regions of the annotation space, where disagreement among annotators is also higher\. Conversely, datasets characterized by a larger number of ghost predictions show a weaker association between Ghost NCS and annotator isolation\.

Figure[2](https://arxiv.org/html/2606.02911#S4.F2)Top shows that Disentangling is the dataset with the lowest proportion of Ghost Predictions across all LLMs, while smaller models generally exhibit higher Ghost Prediction rates than larger ones, with Attitudes representing the main exception to this trend\. Figure[2](https://arxiv.org/html/2606.02911#S4.F2)\(Bottom\) further shows that Disentangling also exhibits the strongest negative correlation between Ghost NCS and annotator isolation\. A similar pattern is observed for CADE, whereas weaker correlations emerge in Attitudes, which is characterized by higher Ghost Prediction frequencies\. The main exception is MHS, where the correlation remains consistently weak despite relatively high Ghost Prediction rates\.

Overall, these findings suggest that Ghost NCS captures not only prediction uncertainty, but also the extent to which model divergences align with regions of human disagreement\.

![Refer to caption](https://arxiv.org/html/2606.02911v1/x2.png)
![Refer to caption](https://arxiv.org/html/2606.02911v1/figures/Figure_1_bottom.png)

Figure 2:\(Top\) The correlation between the model’s Ghost average NCS and the fraction of human agreement on each comment, computed for each model–dataset pair\. \(Bottom\) Agreement with annotator isolation, defined as the complement of agreement, and reports the corresponding correlations\. In both cases, only comments where the model prediction does not appear in the set of human annotations are considered\. The heatmaps report Pearson correlation coefficients, along with statistical significance \(p\-values\), across all model–dataset combinations\.DatasetModel18\-30 M Arab18\-30 M LatAm18\-30 M SSA18\-30 W Indian18\-30 W LatAm18\-30 W NA18\-30 W SSA18\-30 W WE30\-50 M Indian30\-50 W OceaniaDavaniLlama\-3\.1\-8B0\.2208\\cellcolorred\!200\.20440\.27220\.22730\.25310\.23600\.22820\.24550\.2683\\cellcolorblue\!200\.3534DavaniLlama\-3\.2\-1B0\.23380\.2555\\cellcolorred\!200\.13020\.30300\.22220\.27950\.21480\.2395\\cellcolorblue\!200\.34150\.3308DavaniQwen\-1\.5B\\cellcolorblue\!200\.29220\.22630\.2426\\cellcolorred\!200\.21970\.22220\.22360\.24830\.28140\.26020\.2857DavaniQwen\-7B0\.25320\.2555\\cellcolorred\!200\.1775\\cellcolorblue\!200\.31820\.29630\.23600\.19460\.25750\.23580\.2932MeasuringLlama\-3\.1\-8B0\.24840\.2500\\cellcolorred\!200\.21450\.27080\.25620\.25310\.23830\.25150\.2419\\cellcolorblue\!200\.2838MeasuringLlama\-3\.2\-1B0\.24510\.2445\\cellcolorred\!200\.21750\.27650\.25620\.25160\.23660\.25300\.2398\\cellcolorblue\!200\.2876MeasuringQwen\-1\.5B0\.24510\.2445\\cellcolorred\!200\.21750\.27650\.25620\.25160\.23660\.25300\.2398\\cellcolorblue\!200\.2876MeasuringQwen\-7B0\.24510\.2500\\cellcolorred\!200\.21600\.27650\.25310\.24840\.23990\.25300\.2398\\cellcolorblue\!200\.2876CADELlama\-3\.1\-8B0\.24510\.2464\\cellcolorred\!200\.21450\.27840\.25620\.25160\.23490\.25300\.2398\\cellcolorblue\!200\.2895CADELlama\-3\.2\-1B0\.24350\.2518\\cellcolorred\!200\.21300\.27460\.25620\.25000\.23830\.25600\.2419\\cellcolorblue\!200\.2838CADEQwen\-1\.5B0\.24680\.2445\\cellcolorred\!200\.21600\.28220\.25460\.25310\.23990\.24850\.2378\\cellcolorblue\!200\.2857CADEQwen\-7B0\.24510\.2500\\cellcolorred\!200\.21450\.27650\.25460\.25160\.23490\.25300\.2398\\cellcolorblue\!200\.2895AttitudesLlama\-3\.1\-8B0\.24680\.2445\\cellcolorred\!200\.21750\.27650\.25460\.25000\.23320\.25450\.2439\\cellcolorblue\!200\.2876AttitudesLlama\-3\.2\-1B0\.24510\.2445\\cellcolorred\!200\.21600\.27650\.25620\.25160\.23660\.25300\.2398\\cellcolorblue\!200\.2895AttitudesQwen\-1\.5B0\.24510\.2445\\cellcolorred\!200\.21600\.27460\.25460\.25160\.23830\.25300\.2398\\cellcolorblue\!200\.2914AttitudesQwen\-7B0\.24510\.2464\\cellcolorred\!200\.21600\.27270\.25460\.25160\.24160\.25150\.2398\\cellcolorblue\!200\.2895

Table 2:Delta values across datasets, models, and demographic groups\. Row\-wise maxima \(blue, bold\) and minima \(red, italic\) are highlighted\.

### 4\.2\[RQ2\] Do models align with specific categories of annotators?

Our second experiment adopts the Ghost Annotator \(Section[3\.3](https://arxiv.org/html/2606.02911#S3.SS3)\) to identify whether LLMs align with the perspectives of annotators characterized by specific socio\-demographic traits\. Since the only dataset including a balanced representation of socio\-demographic traits across different axes is Disentangling, our analysis focused only on it\.

Our experiment involves the following steps for each model:

1. 1\.we generated a representation of the model by computing the Ghost Annotator \(Section[3\.3](https://arxiv.org/html/2606.02911#S3.SS3)\);
2. 2\.we selected the1010larger socio\-demographic groups based on the intersection of three features: gender, age, and macro\-region of origin555The list of intersectional groups is reported in Appendix[3](https://arxiv.org/html/2606.02911#A4.T3);
3. 3\.we computed the cosine similarity between the vector representation of the Ghost Annotator and the representation of each annotator in Disentangling and grouped the whole distribution of annotators in quartiles;
4. 4\.for each of the1010socio\-demographic groups we computed the ratio between the number of annotators that are in the fourth quartile \(nearest to the model\) and the total number of annotators belonging to that group in order to assess whether some groups are nearer to the Ghost Annotator;
5. 5\.we created a representation of the Ghost Annotator based on Measuring, Attitudes, and CADE and repeated the whole procedure to identify potential similarity patterns across corpora\.

This approach allows us to identify patterns that characterize the interaction between models, human annotators, and datasets: we do not only assess the eventual alignment of a model with specific socio\-demographic groups but also its generalizability outside the context of a specific corpus\.

We verified the robustness of our approach through two tests:i\.we generated a random distribution of synthetic similarity scores setting as higher and lower values the higher and lower values in the real distributions;ii\.we kept the original similarity scores between humans and the Ghost Annotator and randomized them\. We repeated the procedure100100times and computed the Wilcoxon\-Mann\-Whitney independency test\. The procedure shows statistically significant variation between the real annotations and the two setups, demonstrating that the similarity scores of annotators are not by chance\.

#### Demographic misalignment is robust across models and tasks\.

The results presented in Table[2](https://arxiv.org/html/2606.02911#S4.T2)reveal a striking pattern of consistency\. When the Ghost Annotator is profiled on Davani and the annotators are also from Davani the similarity with specific socio\-demographic groups shifts across models\. E\.g\., the Ghost Annotator derived from Llama\-3\.2\-1B is more aligned with30\-50, Male, Indian Cultural Sphere; Qwen\-1\.5B with18\-30, Male, Arab Culture\. When the Ghost Annotator is profiled on the other datasets \(Measuring, CADE, Attitudes\), it systematically aligns least with annotators from the18\-30, Male, Sub\-Saharan Africa\(SSA\) demographic group, and most with annotators from the30\-50, Female, Oceaniagroup\. This pattern holds regardless of whether the source dataset is Measuring, CADE, or Attitudes, with delta values varying only in the fourth decimal place across Llama\-3\.1\-8B, Llama\-3\.2\-1B, Qwen\-1\.5B, and Qwen\-7B\. The robustness of this finding across architectures from different organizations \(Meta and Alibaba\) and across parameter scales \(1B to 8B\) strongly suggests that the observed misalignment is not a model\-specific artifact but rather reflects a structural property of the annotation tasks or, more broadly, of the pretraining corpora on which these models converge\.

#### Annotator pool size does not predict model alignment\.

A natural confound to address is whether the observed demographic asymmetry is a consequence of differential representation in the annotator pool\. Our data refutes this explanation: the SSA group is in fact thelargestdemographic group in the annotator pool \(n=169n=169\), yet it consistently yields the minimum delta value\. This inversion has direct implications for annotation practice\. It suggests that scaling annotator diversity by headcount alone is insufficient to achieve model\-human alignment across demographic groups\. The model’s failure mode, as captured by the Ghost Annotator, reflects a perspective that is systematically distant from SSA annotators’ judgments regardless of how many of them contribute to the dataset\.

#### Bias may be upstream of fine\-tuning\.

The near\-identical delta values produced by architecturally distinct models trained by different organizations point toward a common source of bias that precedes task\-specific fine\-tuning\. We hypothesize that this source lies in the pretraining corpora, which, despite differences in curation and filtering pipelines, likely share a systematic underrepresentation of SSA perspectives on what constitutes offensive or hateful content\. Under this hypothesis, the Ghost Annotator does not capture idiosyncratic model behavior but rather a shared, industry\-wide representation of “normative” annotation that is misaligned with SSA judgments\. This interpretation is consistent with prior work documenting the geographic and linguistic skew of large\-scale web corporaLuccioni and Viviano \([2021](https://arxiv.org/html/2606.02911#bib.bib70)\); Dodgeet al\.\([2021](https://arxiv.org/html/2606.02911#bib.bib67)\); Stranisci and Hardmeier \([2026](https://arxiv.org/html/2606.02911#bib.bib69)\), and extends those findings to the level of demographic alignment in content moderation tasks\. Critically, if the bias is rooted in pretraining, dataset\-level interventions such as increasing annotator diversity are unlikely to resolve it without corresponding changes to how models are pretrained or adapted\.

## 5Conclusion

This work proposes the Ghost Annotator, a framework designed to uncover and assess divergences between LLMs and specific groups of human annotators through uncertainty estimation\. We evaluated four models of different sizes across four datasets reporting disaggregated scalar annotations on diverse dimensions of abusive language: violence, hate speech, acceptability, and offensiveness\. Our findings show that Non\-Conformity Scores increase as annotator disagreement increases but this relationship is strongest in datasets where Ghost Predictions are less frequent\. This suggests that Ghost NCS captures not only prediction uncertainty but also the degree to which model divergences are structurally located in regions of genuine human disagreement\. We also identified a robust pattern of demographic misalignment patterns against specific socio\-demographic groups that holds regardless of model architecture or parameter scale\. Future work will be devoted to explore the effects of models calibration over specific groups of annotators to identify potential bias mitigation strategies\.

## Limitations

RQ2 findings rely only on DAVANI corpus, which is the only dataset containing a balanced set of annotators based on their demographics\. We are aware of this limitation and for this reason we consider important, in future experimental setting, to work on primary data, collecting balanced datasets across finer grained sets of identity traits\.

Another limitation of our analysis is about the model families and size we evaluated\. For the selection of models we took into account their open availability and their possible extensive use because of small and medium size \(i\.e\., requiring lower computational power\)\. However, we are aware that services and applications for daily assistant activities are fed mainly by close models, and in the future we consider to employ the proposed framework to evaluate the imperfections of real\-world applications\.

## Ethical Considerations

Our research focuses on capturing sociodemographic biases in models already used by users worldwide\. We are conscious that it is risky to consider limited societal biases and adopt a binary categorization for gender\. However, the proposed framework is employable to multiple categories and societal dimensions \(e\.g\., ethnicity, origin, disabilities, educational status and so on\)\. We hope our framework can be used to analyze the safety of the models before their release, and that this investigation can encourage attention to societal issues in the creation of AI\.

During the writing of the paper we used AI technologies for grammar and spelling check\.

## References

- Llm\-generated word association norms\.Frontiers in Artificial Intelligence and Applications386,pp\. 3–12\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- G\. V\. Aher, R\. I\. Arriaga, and A\. T\. Kalai \(2023\)Using large language models to simulate multiple humans and replicate human subject studies\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 337–371\.External Links:[Link](https://proceedings.mlr.press/v202/aher23a.html)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- S\. Akhtar, V\. Basile, and V\. Patti \(2021\)Whose opinions matter? perspective\-aware models to identify opinions of hate speech victims in abusive language detection\.arXiv preprint arXiv:2106\.15896\.Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- A\. Anand, N\. Mokhberian, P\. Kumar, A\. Saha, Z\. He, A\. Rao, F\. Morstatter, and K\. Lerman \(2024\)Don‘t blame the data, blame the model: understanding noise and bias when learning from subjective annotations\.Inuncertainlp:2024:1,R\. Vázquez, H\. Celikkanat, D\. Ulmer, J\. Tiedemann, S\. Swayamdipta, W\. Aziz, B\. Plank, J\. Baan, and M\. de Marneffe \(Eds\.\),St Julians, Malta,pp\. 102–113\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.uncertainlp-1.11/)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1),[§1](https://arxiv.org/html/2606.02911#S1.p7.1),[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- A\. N\. Angelopoulos, S\. Bates,et al\.\(2023\)Conformal prediction: a gentle introduction\.Foundations and trends® in machine learning16\(4\),pp\. 494–591\.Cited by:[§3\.1](https://arxiv.org/html/2606.02911#S3.SS1.p1.1)\.
- C\. Baumler, A\. Sotnikova, and H\. Daumé III \(2023\)Which examples should be multiply annotated? active learning when annotators may disagree\.Infindings:2023:acl,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 10352–10371\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2023.findings-acl.658/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.658)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- R\. D\. Burke, A\. Felfernig, and M\. H\. Göker \(2011\)Recommender systems: an overview\.AI Mag\.32\(3\),pp\. 13–18\.External Links:[Link](https://doi.org/10.1609/aimag.v32i3.2361),[Document](https://dx.doi.org/10.1609/AIMAG.V32I3.2361)Cited by:[§3\.3](https://arxiv.org/html/2606.02911#S3.SS3.p1.1)\.
- F\. Cabitza, A\. Campagner, and V\. Basile \(2023\)Toward a perspectivist turn in ground truthing for predictive computing\.Proceedings of the AAAI Conference on Artificial Intelligence37\(6\),pp\. 6860–6868\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/25840),[Document](https://dx.doi.org/10.1609/aaai.v37i6.25840)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1)\.
- N\. Calderon, R\. Reichart, and R\. Dror \(2025\)The alternative annotator test for LLM\-as\-a\-judge: how to statistically justify replacing human annotators with LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16051–16081\.External Links:[Link](https://aclanthology.org/2025.acl-long.782/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.782),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- M\. Campos, A\. Farinhas, C\. Zerva, M\. A\. Figueiredo, and A\. F\. Martins \(2024\)Conformal prediction for natural language processing: a survey\.Transactions of the Association for Computational Linguistics12,pp\. 1497–1516\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- Z\. Chen, Y\. Xie, and M\. Fishel \(2023\)Conformal prediction for natural language processing: a survey\.Transactions of the Association for Computational Linguistics\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p3.1),[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- X\. Dai, L\. Zhou, B\. Wang, and H\. Li \(2025\)From word to world: evaluate and mitigate culture bias in LLMs via word association test\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24521–24537\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1246/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1246),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- J\. Dodge, M\. Sap, A\. Marasović, W\. Agnew, G\. Ilharco, D\. Groeneveld, M\. Mitchell, and M\. Gardner \(2021\)Documenting large webtext corpora: a case study on the colossal clean crawled corpus\.Inemnlp:2021:main,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 1286–1305\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2021.emnlp-main.98/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.98)Cited by:[§4\.2](https://arxiv.org/html/2606.02911#S4.SS2.SSS0.Px3.p1.1)\.
- M\. Fontana, G\. Zeni, and S\. Vantini \(2023\)Conformal prediction: a unified review of theory and new challenges\.Bernoulli29\(1\),pp\. 1–23\.Cited by:[§3\.1](https://arxiv.org/html/2606.02911#S3.SS1.p1.1)\.
- P\. Fortuna and S\. Nunes \(2018\)A survey on automatic detection of hate speech in text\.ACM Comput\. Surv\.51\(4\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3232676),[Document](https://dx.doi.org/10.1145/3232676)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1)\.
- S\. Frenda, G\. Abercrombie, V\. Basile, A\. Pedrani, R\. Panizzon, A\. T\. Cignarella, C\. Marco, and D\. Bernardi \(2025\)Perspectivist approaches to natural language processing: a survey\.Language Resources and Evaluation59\(2\),pp\. 1719–1746\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1),[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- K\. Gligorić, T\. Zrnic, C\. Lee, E\. Candes, and D\. Jurafsky \(2025\)Can unconfident llm annotations be used for confident conclusions?\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3514–3533\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- D\. Goldberg, D\. Nichols, B\. M\. Oki, and D\. Terry \(1992\)Using collaborative filtering to weave an information tapestry\.Commun\. ACM35\(12\),pp\. 61–70\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/138859.138867),[Document](https://dx.doi.org/10.1145/138859.138867)Cited by:[§3\.3](https://arxiv.org/html/2606.02911#S3.SS3.p1.1)\.
- M\. L\. Gordon, M\. S\. Lam, J\. S\. Park, K\. Patel, J\. Hancock, T\. Hashimoto, and M\. S\. Bernstein \(2022\)Jury learning: integrating dissenting voices into machine learning models\.InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems,pp\. 1–19\.Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- C\. Gruber, H\. Alber, B\. Bischl, G\. Kauermann, B\. Plank, and M\. Aßenmacher \(2025\)Revisiting active learning under \(human\) label variation\.InProceedings of the The 4th Workshop on Perspectivist Approaches to NLP,G\. Abercrombie, V\. Basile, S\. Frenda, S\. Tonelli, and S\. Dudy \(Eds\.\),Suzhou, China,pp\. 75–86\.External Links:[Link](https://aclanthology.org/2025.nlperspectives-1.7/),[Document](https://dx.doi.org/10.18653/v1/2025.nlperspectives-1.7),ISBN 979\-8\-89176\-350\-0Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- D\. Hovy and S\. Prabhumoye \(2021\)Five sources of bias in natural language processing\.Language and linguistics compass15\(8\),pp\. e12432\.Cited by:[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.p1.1)\.
- L\. Jiang, T\. Sorensen, S\. Levine, and Y\. Choi \(2025\)Can language models reason about individualistic human values and preferences?\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6757–6794\.External Links:[Link](https://aclanthology.org/2025.acl-long.336/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.336),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- J\. Kocoń, A\. Figas, M\. Gruza, D\. Puchalska, T\. Kajdanowicz, and P\. Kazienko \(2021a\)Offensive, aggressive, and hate speech analysis: from data\-centric to human\-centered approach\.Information Processing & Management58\(5\),pp\. 102643\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2021.102643),[Link](https://www.sciencedirect.com/science/article/pii/S0306457321001333)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1)\.
- J\. Kocoń, M\. Gruza, J\. Bielaniewicz, D\. Grimling, K\. Kanclerz, P\. Miłkowski, and P\. Kazienko \(2021b\)Learning personal human biases and representations for subjective tasks in natural language processing\.In2021 IEEE International Conference on Data Mining \(ICDM\),Vol\.,pp\. 1168–1173\.External Links:[Document](https://dx.doi.org/10.1109/ICDM51629.2021.00140)Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- J\. Lan, D\. Frassinelli, and B\. Plank \(2025\)Mind the uncertainty in human disagreement: evaluating discrepancies between model predictions and human responses in vqa\.Proceedings of the AAAI Conference on Artificial Intelligence39\(4\),pp\. 4446–4454\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/32468),[Document](https://dx.doi.org/10.1609/aaai.v39i4.32468)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- E\. Leonardelli, S\. Casola, S\. Peng, G\. Rizzi, V\. Basile, E\. Fersini, D\. Frassinelli, H\. Jang, M\. Pavlovic, B\. Plank, and M\. Poesio \(2025\)LeWiDi\-2025 at NLPerspectives: the third edition of the learning with disagreements shared task\.InProceedings of the The 4th Workshop on Perspectivist Approaches to NLP,G\. Abercrombie, V\. Basile, S\. Frenda, S\. Tonelli, and S\. Dudy \(Eds\.\),Suzhou, China,pp\. 182–195\.External Links:[Link](https://aclanthology.org/2025.nlperspectives-1.16/),[Document](https://dx.doi.org/10.18653/v1/2025.nlperspectives-1.16),ISBN 979\-8\-89176\-350\-0Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- E\. Leonardelli, S\. Menini, A\. Palmero Aprosio, M\. Guerini, S\. Tonelli,et al\.\(2021\)Agreeing to disagree: annotating offensive language datasets with annotators’ disagreement\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 10528–10539\.Cited by:[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.SSS0.Px1.p1.4)\.
- H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu \(2024\)Llms\-as\-judges: a comprehensive survey on llm\-based evaluation methods\.arXiv preprint arXiv:2412\.05579\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- S\. M\. Lo, O\. Araque, R\. Sharma, and M\. A\. Stranisci \(2025a\)That is unacceptable: the moral foundations of canceling\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6625–6639\.External Links:[Link](https://aclanthology.org/2025.acl-long.330/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.330),ISBN 979\-8\-89176\-251\-0Cited by:[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.p3.3),[Table 1](https://arxiv.org/html/2606.02911#S3.T1.7.3.1.2.1.2.1)\.
- S\. M\. Lo, S\. Casola, E\. Sezerer, V\. Basile, F\. Sansonetti, A\. Uva, and D\. Bernardi \(2025b\)PERSEVAL: a framework for perspectivist classification evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 22334–22359\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1137/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1137),ISBN 979\-8\-89176\-332\-6Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- A\. Luccioni and J\. Viviano \(2021\)What’s in the box? an analysis of undesirable content in the common crawl corpus\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),pp\. 182–189\.Cited by:[§4\.2](https://arxiv.org/html/2606.02911#S4.SS2.SSS0.Px3.p1.1)\.
- P\. Maes \(1994\)Agents that reduce work and information overload\.Commun\. ACM37\(7\),pp\. 30–40\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/176789.176792),[Document](https://dx.doi.org/10.1145/176789.176792)Cited by:[§3\.3](https://arxiv.org/html/2606.02911#S3.SS3.p1.1)\.
- W\. Mieleszczenko\-Kowszewicz, K\. Kanclerz, J\. Bielaniewicz, M\. Oleksy, M\. Gruza, S\. Wozniak, E\. Dzieciol, P\. Kazienko, and J\. Kocon \(2023\)Capturing human perspectives in nlp: questionnaires, annotations, and biases\.\.InNLPerspectives@ ECAI,Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- N\. Mokhberian, M\. Marmarelis, F\. Hopp, V\. Basile, F\. Morstatter, and K\. Lerman \(2024\)Capturing perspectives of crowdsourced annotators in subjective learning tasks\.Innaacl:2024:long,K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 7337–7349\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.naacl-long.407/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.407)Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- A\. Mostafazadeh Davani, M\. Diaz, D\. K\. Baker, and V\. Prabhakaran \(2024a\)D3CODE: disentangling disagreements in data across cultures on offensiveness detection and evaluation\.Inemnlp:2024:main,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 18511–18526\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.emnlp-main.1029/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1029)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- A\. Mostafazadeh Davani, M\. Díaz, D\. Baker, and V\. Prabhakaran \(2024b\)Disentangling perceptions of offensiveness: cultural and moral correlates\.InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency,pp\. 2007–2021\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.p4.3),[Table 1](https://arxiv.org/html/2606.02911#S3.T1.7.4.1.2.1.2.1)\.
- A\. Mostafazadeh Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2022.tacl-1.6/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00449)Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- M\. Orlikowski, J\. Pei, P\. Röttger, P\. Cimiano, D\. Jurgens, and D\. Hovy \(2025\)Beyond demographics: fine\-tuning large language models to predict individuals’ subjective text perceptions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 2092–2111\.External Links:[Link](https://aclanthology.org/2025.acl-long.104/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.104),ISBN 979\-8\-89176\-251\-0Cited by:[§3\.2](https://arxiv.org/html/2606.02911#S3.SS2.p1.4)\.
- M\. Pavlovic and M\. Poesio \(2024a\)The effectiveness of LLMs as annotators: a comparative overview and empirical analysis of direct representation\.Innlperspectives:2024:1,G\. Abercrombie, V\. Basile, D\. Bernadi, S\. Dudy, S\. Frenda, L\. Havens, and S\. Tonelli \(Eds\.\),Torino, Italia,pp\. 100–110\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.nlperspectives-1.11/)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- M\. Pavlovic and M\. Poesio \(2024b\)Understanding the effect of temperature on alignment with human opinions\.Proceedings of Algorithmic Fairness through the lens of Metrics and Evaluation Workshop\.Cited by:[footnote 6](https://arxiv.org/html/2606.02911#footnote6)\.
- B\. Plank \(2022\)The “problem” of human label variation: on ground truth in data, modeling and evaluation\.Inemnlp:2022:main,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 10671–10682\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2022.emnlp-main.731/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.731)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1)\.
- A\. S\. Rao, A\. Yerukola, V\. Shah, K\. Reinecke, and M\. Sap \(2025\)NormAd: a framework for measuring the cultural adaptability of large language models\.Innaacl:2025:long,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2373–2403\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2025.naacl-long.120/),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- F\. Ricci, L\. Rokach, and B\. Shapira \(2022\)Recommender systems: techniques, applications, and challenges\.InRecommender Systems Handbook,F\. Ricci, L\. Rokach, and B\. Shapira \(Eds\.\),pp\. 1–35\.External Links:[Link](https://doi.org/10.1007/978-1-0716-2197-4%5C_1),[Document](https://dx.doi.org/10.1007/978-1-0716-2197-4%5F1)Cited by:[§3\.3](https://arxiv.org/html/2606.02911#S3.SS3.p1.1)\.
- P\. Sachdeva, R\. Barreto, G\. Bacon, A\. Sahn, C\. von Vacano, and C\. Kennedy \(2022\)The measuring hate speech corpus: leveraging rasch measurement theory for data perspectivism\.Innlperspectives:2022:1,G\. Abercrombie, V\. Basile, S\. Tonelli, V\. Rieser, and A\. Uma \(Eds\.\),Marseille, France,pp\. 83–94\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2022.nlperspectives-1.11/)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.p5.3),[Table 1](https://arxiv.org/html/2606.02911#S3.T1.7.5.1.2.1.2.1)\.
- M\. Sandri, E\. Leonardelli, S\. Tonelli, and E\. Jezek \(2023\)Why don‘t you do it right? analysing annotators’ disagreement in subjective tasks\.Ineacl:2023:main,A\. Vlachos and I\. Augenstein \(Eds\.\),Dubrovnik, Croatia,pp\. 2428–2441\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2023.eacl-main.178/),[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.178)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- M\. Sap, S\. Swayamdipta, L\. Vianna, X\. Zhou, Y\. Choi, and N\. A\. Smith \(2022a\)Annotators with attitudes: how annotator beliefs and identities bias toxic language detection\.Innaacl:2022:main,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 5884–5906\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2022.naacl-main.431/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.431)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- M\. Sap, S\. Swayamdipta, L\. Vianna, X\. Zhou, Y\. Choi, and N\. A\. Smith \(2022b\)Annotators with attitudes: how annotator beliefs and identities bias toxic language detection\.InProceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies,pp\. 5884–5906\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.02911#S3.SS4.p2.3),[Table 1](https://arxiv.org/html/2606.02911#S3.T1.7.2.1.2.1.2.1)\.
- O\. O\. Sarumi, C\. Welch, D\. Braun, and J\. Schlötterer \(2025\)The impact of annotator personas on LLM behavior across the perspectivism spectrum\.InProceedings of the 8th International Conference on Natural Language and Speech Processing \(ICNLSP\-2025\),M\. Abbas, T\. Yousef, and L\. Galke \(Eds\.\),Southern Denmark University, Odense, Denmark,pp\. 121–136\.External Links:[Link](https://aclanthology.org/2025.icnlsp-1.14/)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1),[footnote 6](https://arxiv.org/html/2606.02911#footnote6)\.
- J\. B\. Schafer, D\. Frankowski, J\. Herlocker, and S\. Sen \(2007\)Collaborative filtering recommender systems\.InThe Adaptive Web: Methods and Strategies of Web Personalization,pp\. 291–324\.External Links:ISBN 978\-3\-540\-72079\-9,[Document](https://dx.doi.org/10.1007/978-3-540-72079-9%5F9),[Link](https://doi.org/10.1007/978-3-540-72079-9_9)Cited by:[§3\.3](https://arxiv.org/html/2606.02911#S3.SS3.p1.1)\.
- W\. S\. Schmeisser\-Nieto, P\. Pastells, S\. Frenda, and M\. Taule \(2024\)Human vs\. machine perceptions on immigration stereotypes\.Inlrec:2024:main,N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 8453–8463\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.lrec-main.741/)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p7.1),[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- H\. Sheng, X\. Liu, H\. He, J\. Zhao, and J\. Kang \(2025\)Analyzing uncertainty of LLM\-as\-a\-judge: interval evaluations with conformal prediction\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11286–11328\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.569/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.569),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- M\. A\. Stranisci and C\. Hardmeier \(2026\)What are they filtering out? an experimental benchmark of filtering strategies for harm reduction in pretraining datasets\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 39303–39313\.Cited by:[§4\.2](https://arxiv.org/html/2606.02911#S4.SS2.SSS0.Px3.p1.1)\.
- Z\. Tan, D\. Li, S\. Wang, A\. Beigi, B\. Jiang, A\. Bhattacharjee, M\. Karami, J\. Li, L\. Cheng, and H\. Liu \(2024\)Large language models for data annotation and synthesis: a survey\.Inemnlp:2024:main,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 930–957\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.emnlp-main.54/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.54)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1),[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- A\. N\. Uma, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, and M\. Poesio \(2021\)Learning from disagreement: a survey\.Journal of Artificial Intelligence Research72,pp\. 1385–1470\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p1.1)\.
- A\. Urbinati, M\. Lai, S\. Frenda, and M\. Stranisci \(2025\)Are you sure? measuring models bias in content moderation through uncertainty\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 18061–18076\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.980/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.980),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- M\. van der Meer, N\. Falk, P\. K\. Murukannaiah, and E\. Liscio \(2024\)Annotator\-centric active learning for subjective NLP tasks\.Inemnlp:2024:main,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 18537–18555\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.emnlp-main.1031/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1031)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- M\. Van Der Meer, N\. Falk, P\. Murukannaiah, and E\. Liscio \(2024\)Annotator\-centric active learning for subjective nlp tasks\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18537–18555\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1)\.
- B\. Vidgen and L\. Derczynski \(2020\)Directions in abusive language training data, a systematic review: garbage in, garbage out\.Plos one15\(12\),pp\. e0243300\.Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1)\.
- G\. Villate\-Castillo, J\. Del Ser, and B\. Sanz \(2025\)A collaborative content moderation framework for toxicity detection based on multitask neural networks and conformal estimates of annotation disagreement\.Neurocomputing647,pp\. 130542\.External Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2025.130542),[Link](https://www.sciencedirect.com/science/article/pii/S0925231225012147)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- N\. Vitsakis, A\. Parekh, and I\. Konstas \(2024\)Voices in a crowd: searching for clusters of unique perspectives\.Inemnlp:2024:main,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12517–12539\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.emnlp-main.696/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.696)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1)\.
- R\. Wan, H\. Wang, T\. K\. Huang, and J\. Gao \(2025\)From noise to nuance: enriching subjective data annotation through qualitative analysis\.InProceedings of the Fourth Workshop on Bridging Human\-Computer Interaction and Natural Language Processing \(HCI\+NLP\),S\. L\. Blodgett, A\. C\. Curry, S\. Dev, S\. Li, M\. Madaio, J\. Wang, S\. T\. Wu, Z\. Xiao, and D\. Yang \(Eds\.\),Suzhou, China,pp\. 240–254\.External Links:[Link](https://aclanthology.org/2025.hcinlp-1.20/),[Document](https://dx.doi.org/10.18653/v1/2025.hcinlp-1.20),ISBN 979\-8\-89176\-353\-1Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- S\. Wang, Y\. Jiang, Y\. Tang, L\. Cheng, and H\. Chen \(2025\)Copu: conformal prediction for uncertainty quantification in natural language generation\.arXiv preprint arXiv:2502\.12601\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- X\. Wang and B\. Plank \(2023\)ACTOR: active learning with annotator\-specific classification heads to embrace human label variation\.Inemnlp:2023:main,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2046–2052\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2023.emnlp-main.126/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.126)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p1.1)\.
- M\. Wich, C\. Widmer, G\. Hagerer, and G\. Groh \(2021\)Investigating annotator bias in abusive language datasets\.Inranlp:2021:1,R\. Mitkov and G\. Angelova \(Eds\.\),Held Online,pp\. 1515–1525\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2021.ranlp-1.170/)Cited by:[§1](https://arxiv.org/html/2606.02911#S1.p2.1)\.
- D\. Wright, A\. Arora, N\. Borenstein, S\. Yadav, S\. Belongie, and I\. Augenstein \(2024\)LLM tropes: revealing fine\-grained values and opinions in large language models\.Infindings:2024:emnlp,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17085–17112\.External Links:[Link](https://arxiv.org/html/2606.02911v1/anth2024.findings-emnlp.995/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.995)Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p2.1)\.
- C\. Zerva and A\. F\. Martins \(2024\)Conformalizing machine translation evaluation\.Transactions of the Association for Computational Linguistics12,pp\. 1460–1478\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.
- X\. Zhan, Q\. Xu, Y\. Zheng, G\. Lu, and O\. Gevaert \(2023\)Reliability\-based cleaning of noisy training labels with inductive conformal prediction in multi\-modal biomedical data mining\.arXiv preprint arXiv:2309\.07332\.Cited by:[§2](https://arxiv.org/html/2606.02911#S2.p3.1)\.

## Appendix AMathematical Formulation

#### Non Conformity Score\.

For a single annotated text, and a set of possible labels,𝒴\\mathcal\{Y\}, theBrier Scoreb​\(t\)b\(t\)for textttcan be written as

b​\(t,𝒴\)=1\|𝒴\|​∑y∈𝒴\(oy​\(t\)−pM​\(y∣t\)\)2b\(t,\\mathcal\{Y\}\)=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{y\\in\\mathcal\{Y\}\}\\left\(o\_\{y\}\(t\)\-p\_\{M\}\(y\\mid t\)\\right\)^\{2\}\(2\)
where:

- •oy​\(t\)o\_\{y\}\(t\)is the binary indicator \(1 if the true label isyy, else 0\)\.
- •pM​\(y∣t\)p\_\{M\}\(y\\mid t\)is the model\-predicted probability for labelyy\.

TheBrier Scoreis directly used as a single conformity score to quantify the alignment of model predictions with observed outcomes\. A lower score indicates better conformity, reflecting predictions that are less uncertain and better calibrated\.

Exploiting the Brier Score, we compute the Non Conformity Score \(NCS\) that measures the variability in the model’s confidence when predictions are compared with human annotationyyThe NCS penalizes the model both for assigning low probability to the reference labelyyand for placing the remaining probability on a single wrong label rather than spreading it across several\. A confident mistake is therefore penalized more than an uncertain one\.

NCS​\(t,y\)=1\|𝒴\|​\[\(1−pM​\(y∣t\)\)2\+∑y′≠ypM​\(y′∣t\)2\]\\mathrm\{NCS\}\(t,y\)=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\left\[\(1\-p\_\{M\}\(y\\mid t\)\)^\{2\}\+\\sum\_\{y^\{\\prime\}\\neq y\}p\_\{M\}\(y^\{\\prime\}\\mid t\)^\{2\}\\right\]\(3\)
Lower NCS values indicate higher alignment between the model and the human label, whereas higher values correspond to more uncertain or less concentrated predictive distributions\.

To understand the general behavior of the model, we also compute theN​C​Sa​v​g​\(M\)NCS\_\{avg\}\(M\)as the average across all annotators and texts:

N​C​Sa​v​g​\(M\)=∑t,ib​\(t,𝒴ai\)−b​\(t,𝒴M\)TNCS\_\{avg\}\(M\)=\\frac\{\\sum\_\{t,i\}b\(t,\\mathcal\{Y\}\_\{a\_\{i\}\}\)\-b\(t,\\mathcal\{Y\}\_\{M\}\)\}\{T\}\(4\)
whereTTis the total number of annotations, considering all texts and all annotators\.

#### Ghost Annotator Vector\.

Given theN​C​Sa​v​g​\(M\)NCS\_\{avg\}\(M\)definition we can define theGhost Annotator Vectoras:

N​C​S→=\{Qp\(b\(k,𝒴ai\)−b\(k,𝒴M\)\)\)\}p\\vec\{NCS\}=\\left\\\{Q\_\{p\}\\\!\\left\(b\(k,\\mathcal\{Y\}\_\{a\_\{i\}\}\)\-b\(k,\\mathcal\{Y\}\_\{M\}\)\)\\right\)\\right\\\}\_\{p\}\(5\)where Q1,Q2,Q3 denote the first, second, and third quartiles of the NCS distribution for modelm\.

Gm→=\(Q1​\(N​C​Sm\),Q2​\(N​C​Sm\),Q3​\(N​C​Sm\)\)\\vec\{G\_\{m\}\}=\\left\(Q\_\{1\}\(NCS\_\{m\}\),\\;Q\_\{2\}\(NCS\_\{m\}\),\\;Q\_\{3\}\(NCS\_\{m\}\)\\right\)\(6\)
Different distributions can be defined, for instance by selecting, as described in Section[4](https://arxiv.org/html/2606.02911#S4), the annotators whoseGm→\\vec\{G\_\{m\}\}distributions most closely match that of the ghost annotator\.

## Appendix BPrompt and Prediction Extraction

The goal of the prompt is to classify social media posts according to specific content\-related categories, with a clear scoring system666Differently from previous worksPavlovic and Poesio \([2024b](https://arxiv.org/html/2606.02911#bib.bib79)\); Sarumiet al\.\([2025](https://arxiv.org/html/2606.02911#bib.bib78)\), we do not tune the models to exhibit behavior similar to humans\.\.

Each dataset used in the experiment is associated with a distinct prompt, where the user is instructed to annotate a social media post with one of the predefined values from the following options:

- •Violence: 0 \(no violence\) to 4 \(extremely violent\)
- •Hate Speech: 1 \(no hate speech\) to 5 \(extremely hate speech\)
- •Unacceptability: 1 \(acceptable\) to 4 \(extremely unacceptable\)
- •Offensiveness: 0 \(no offensiveness\) to 4 \(extremely offensive\)

The prompt structure follows a similar format across all datasets, which ensures consistency in the model’s task understanding and response generation\. An example prompt for violence classification is as follows:

Task: You are a participant in an annotation task for the recognition of violence\.

Instruction: Read the following social media post and annotate it with one value from the following options\. Options: 0, 1, 2, 3, 4\. 0 == no violence, 4 == extremely violent\.

Output format: The answer should follow this template ’answer’: option\. Answer only in JSON\. No extra text\.

Question: How much does the following social media post contain violence?

The trans women reading this tweet right now is beautiful

To extract the predictions, we used the transformers library from Hugging Face\. Initially, the text input, which includes both the prompt and the social media post, was tokenized using the model’s corresponding tokenizer\. This tokenized input was then passed through the model to generate predictions\. The model outputs raw logits, which are uncalibrated scores representing the model’s confidence for each possible token\. These logits were then passed through a log\-softmax function in order to normalize the logits and to provide a probability distribution where the sum of all token probabilities equals one\.

The probabilities for the target labels \(e\.g\., ’0’, ’1’, ’2’, ’3’, ’4’\) were gathered across the generated tokens\. These probabilities were averaged over multiple steps of token generation to provide a more robust prediction\.

## Appendix CHardware and Experimental Setup

Each experimental run was allocated a single compute node with the following specifications: 4 CPU cores, 25GB of RAM, and one NVIDIA H200 GPU\. The experiment ran for 40 hours\. Models were always initialized with their default setup of hyperparameters\.

## Appendix DIntersectional Groups of Annotators in DAVANI corpus

AgeGenderRegionCount18–30ManSub Saharan Africa16918–30WomanWestern Europe16718–30WomanLatin America16218–30WomanNorth America16118–30ManArab Culture15418–30WomanSub Saharan Africa14918–30ManLatin America13730–50WomanOceania13318–30WomanIndian Cultural Sphere13230–50ManIndian Cultural Sphere12330–50WomanSinosphere11918–30WomanOceania11418–30WomanArab Culture110Table 3:Intersectional Groups in DAVANI corpus\.

Similar Articles

Understanding Annotator Safety Policy with Interpretability

arXiv cs.AI

This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.