Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

arXiv cs.CL Papers

Summary

This paper presents a large-scale analysis of four harmful language detection datasets, examining how annotator characteristics and linguistic features interact to influence annotation variation. It highlights intersectional effects and warns against generalizing findings across different datasets.

arXiv:2605.06318v1 Announce Type: new Abstract: Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes. Effect patterns, however, vary considerably across datasets. This urges caution about generalization and transferability.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:42 AM

# Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation
Source: [https://arxiv.org/html/2605.06318](https://arxiv.org/html/2605.06318)
Gabriella Lapesa1,2 1GESIS \- Leibniz Institute for the Social Sciences 2Heinrich\-Heine University Düsseldorf 1first\.last@gesis\.org

###### Abstract

Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced\. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity\. While this resulted in rich information aboutwhoannotated \(sociodemographics, attitudes, etc\.\), thewhat\(e\.g\., linguistic properties of items\), and their interplay has received little attention\. We present the first large\-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture\. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes\. Effect patterns, however, vary considerably across datasets\. This urges caution about generalization and transferability\.111The code of our analyses is available at[https://anonymous\.4open\.science/r/who\_and\_what\-F1C7](https://anonymous.4open.science/r/who_and_what-F1C7)

Disclaimer: This paper contains examples of vulgar expressions and hateful text items\.

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.06318v1/x1.png)Figure 1:Cross\-classified data structure for ordinal text annotations: Each annotation belongs to one unique annotator/item combination\. Each item is part of a batch\. We focus on the structure within the dashed box\.In recent years, calls for considering annotation disagreement\(Basileet al\.,[2021a](https://arxiv.org/html/2605.06318#bib.bib64)\)and embracing annotation variation\(data perspectivism, Cabitzaet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib42)\)have led to the introduction ofdisaggregatedcorpora including individual annotators’ labeling decisions, and modeling approaches accounting for this variation beyond aggregating them into a single gold label\(Davaniet al\.,[2022](https://arxiv.org/html/2605.06318#bib.bib76); Weerasooriyaet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib84), among others\)\. In tasks where the annotator characteristics are of particular interest,perspectivistworks reveal a mixed picture: While some works find that including annotator sociodemographics improves the modeling of annotation variation\(Kocońet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib61); Wanet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib44); Tahaei and Bergler,[2024](https://arxiv.org/html/2605.06318#bib.bib86)\), others do not find convincing evidence that it does\(Orlikowskiet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib78); Sunet al\.,[2025](https://arxiv.org/html/2605.06318#bib.bib83)\)\.

We argue that this discrepancy and the design of existing works reveal gaps in addressing the main underlying question:Whodiffers in their perception ofwhat\(cf\. Sapet al\.,[2022](https://arxiv.org/html/2605.06318#bib.bib82)\)?

Firstly, the assessment of this question is impacted by the structure of how annotations are usually conducted: As exemplified in Figure[1](https://arxiv.org/html/2605.06318#S1.F1), annotations follow a cross\-classified structure, meaning that annotations are simultaneously grouped by items and annotators \(two non\-nested factors\): each item is annotated by multiple annotators, and each annotator annotates multiple items\. One can reasonably assume systematic variation on both levels\. Most works assessing annotation variation, however, do not account for this, resulting in limited generalizability and comparability of the findings\.

Secondly, while analysis approaches such as the ones mentioned above provide a starting point for the community to address the who, the what remains largely ignored\. This gap is particularly noteworthy given that annotation is an interactive process between text items of varying linguistic composition and annotators, whose identities are more complex than individual sociodemographic proxies such as gender\(c\.f\. Orlikowskiet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib78)\)\. This may lead to variations between annotator groups, but only for certain text items\.

Given that it is unclear whether annotator\- and item\-level characteristics are predictive of annotation behavior, we propose to analyze disaggregated data in a principled way to reveal factors of interest when dealing with annotation variation in subjective tasks\.We argue that this is a vital step before training or testing any language model\-based system, as it provides important pointers to the information we might want to include in such a system and what model behavior to expect\.

In this work, we thus conduct the largest annotation variation analysis to date, spanningfour disaggregated harmful communication datasets, containing a total of\>25k items,\>8k annotators, and\>205k annotations\. We take both annotator characteristics \(up to 19\) and item features \(over 300\) into account, as well as interactions among annotator features and between the annotator and item features\. We use Bayesian multilevel regression models to find the most impactful and relevant out of up to 5,264 fixed effects\. We account for the partially cross\-classified structure by including random intercepts for annotators and items\.

On the annotator side, we take available sociodemographic \(e\.g\., age\) and attitudes \(e\.g\., whether annotators think hate speech is a problem\) and two\-way interactions between them into account\. This allows for assessing intersectional effects\.

On the item side, we look at domain\-specific lexical signals, as well as a broad set of general characteristics, ranging from morphosyntactical to psycholinguistic features\. To assess who differs in their perception of what, we include interactions between the annotator and item features\.

We conduct exploratory analyses in three realistic scenarios222While these scenarios are realistic, they were partially motivated by restrictions due to computational resources and limits of existing implementations \(see App\.[L](https://arxiv.org/html/2605.06318#A12)&[M](https://arxiv.org/html/2605.06318#A13)\)\.: \(i\) Comparing effects in related tasks with different conceptualizations, annotation guidelines, items, and annotators\. This aims at finding potentially more general effects in related phenomena \(Section[5](https://arxiv.org/html/2605.06318#S5)\)\. \(ii\) Comparing demographically similar annotator groups: two annotator groups with very similar distributions of annotator characteristics, annotating the same items with the same conceptualization and annotation guidelines\. This can be viewed as a simulation of collecting more annotators for items for which one already has annotations \(Section[6](https://arxiv.org/html/2605.06318#S6)\)\. \(iii\) Comparing different sets of batches\. This can be viewed as datasets collected with different annotators for different items but using the same conceptualization and annotation guidelines \(Section[7](https://arxiv.org/html/2605.06318#S7)\)\.

Our contributions are two\-fold: \(1\) On a methodological level, we conduct a principled in\-depth analysis of large disaggregated datasets\. We discuss relevant questions, assumptions, and decisions at each step of our analyses\. In doing so, we hope to contribute to establishing best practices in the field when analyzing such datasets\. \(2\) On a substantial level, we provide, to the best of our knowledge, the first assessment of annotation behavior from a linguistic, annotator\-item interaction, and intersectional perspective for harmful language datasets\.

Answering to the question "who annotates what" is relevant for multiple NLP research communities: the harmful language detection community benefits from the analysis of these reference datasets and may find more insights in the effect patterns we discovered\. The human label variation community may benefit from considering both the who and the what, for disentangling variation, but also targeted \(re\-\)annotation\. Finally, the modeling/content moderation community benefits from insights informing about differences of tendencies for specific item\-annotator combinations, and for identifying potential spurious confounders\.

## 2Related Work

Besides annotation errors, taxonomies\(Basileet al\.,[2021b](https://arxiv.org/html/2605.06318#bib.bib41); Umaet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib47); Zhanget al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib45)\)have identified three high\-level sources of annotation variation: reasons stemming from the annotator, the items, and the annotation guidelines and settings\.

On the annotators’ side, subjectivity and individual differences in sociodemographic backgrounds and attitudes have received particular attention\. While some works find significant impacts of the country of residence\(Leeet al\.,[2024](https://arxiv.org/html/2605.06318#bib.bib87)\), race\(Larimoreet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib71)\), gender, and agePei and Jurgens \([2023](https://arxiv.org/html/2605.06318#bib.bib79)\), others do not find such differences\(Biesteret al\.,[2022](https://arxiv.org/html/2605.06318#bib.bib88); Sapet al\.,[2022](https://arxiv.org/html/2605.06318#bib.bib82)\)\. Modeling approaches including sociodemographic information reflect this mixed picture\(Wanet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib44); Orlikowskiet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib78); Tahaei and Bergler,[2024](https://arxiv.org/html/2605.06318#bib.bib86); Becket al\.,[2024b](https://arxiv.org/html/2605.06318#bib.bib66); Orlikowskiet al\.,[2025](https://arxiv.org/html/2605.06318#bib.bib77); Sunet al\.,[2025](https://arxiv.org/html/2605.06318#bib.bib83)\)\.Homanet al\.\([2024](https://arxiv.org/html/2605.06318#bib.bib69)\)argue that identities are more complex than individual demographic characteristics, so they investigate intersectional effects, and find differences between intersectional groups, particularly forraceandgender\.

In contrast to the potential reasons for annotation variation on the annotators’ side, work on reasons on the item side remains largely theoretical\. A noteworthy exception isRizziet al\.\([2025](https://arxiv.org/html/2605.06318#bib.bib80)\), who find that for hate speech, certain lexical items are indicative of disagreement in harmful language\. Theoretical taxonomies\(Umaet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib47); Basileet al\.,[2021b](https://arxiv.org/html/2605.06318#bib.bib41)\)name item difficulty, \(missing\) context, and linguistic ambiguities on all levels, as well as uncommon words, code switching, and neologisms as potential reasons for annotation variation\. They, however no not conduct extensive analyses or considering interactions\. For online toxicity,Zhanget al\.\([2023](https://arxiv.org/html/2605.06318#bib.bib45)\)add domain\-specific reasons such as obfuscated racism\. Moreover, they hint at reasons due to an interaction between annotator\- and item\-level factors, such as varying sensitivity to lexical signals\.Sapet al\.\([2022](https://arxiv.org/html/2605.06318#bib.bib82)\)draw further attention to factors relating to both annotators and items\. They analyze the impact of demographics and attitudes on the annotator side, and dialect differences on the item side\. They find that more conservative annotators rate anti\-Black statements lower in toxicity compared to other annotator groups, but African American English items higher\. Similarly,Larimoreet al\.\([2021](https://arxiv.org/html/2605.06318#bib.bib71)\)find interactions between annotatorraceand racially charged lexical signals such as then\-word, or charged topics such as police brutality\.

Table 1:Comparative overview of the analyses done in related work vs\. ours: do they consider annotator characteristics and interpretable item features? For regression\-based experiments, do they account for variation of annotators and items with random intercepts?##### Research Gap

Drawing on these findings, in this work, we view annotation as an interactive process between annotators and items\. As such, we do not only include item\-level and annotator\-level features but also consider interactions between the two levels\. In contrast to related studies, we take the partially cross\-classified data structure into account\. On thewhat\-level, we include both characteristics that can be assumed to be indicative of the phenomenon \(e\.g\., lexical signals idiosyncratic to the phenomenon\) and broader phenomenon\-independent features \(e\.g\., lexical richness or uncertainty markers\)\. The latter may point to more general linguistic indicators of annotation variation\. Table[1](https://arxiv.org/html/2605.06318#S2.T1)provides an overview of the key analysis differences between prior work and our work\.

## 3Data

We use four English harmful language \(hate speech, toxic and offensive language\) datasets fulfilling the desiderata of having \(a\) unaggregated annotations \(i\.e\., individual annotators’ labeling decisions are available\), \(b\) annotator characteristics \(socio\-demographic attributes, attitudes\), and \(c\) fine\-grained annotations \(3\-/5\-point Likert\-scales\)\.

CTDP\.The corpus for toxic content classification for a diversity of perspectives collected byKumaret al\.\([2021](https://arxiv.org/html/2605.06318#bib.bib34)\)consists of social media comments from Reddit, Twitter, and 4chan annotated for offensiveness on a 5\-point scale\. It has the annotator characteristics of gender, age, ethnicity, politics, religious importance, LGBTQ\+, and parent status\.

MHS\.The Measuring Hate Speech Corpus\(Sachdevaet al\.,[2022](https://arxiv.org/html/2605.06318#bib.bib81)\)consists of social media comments annotated for hate speech on a 3\-point scale\. It provides the annotator characteristics race, gender, sexuality, religion, education, and income, and a social media and hatespeech questionnaire asking about general and item\-specific attitudes\.

POPQUORN\.The POPQUORN corpus\(Pei and Jurgens,[2023](https://arxiv.org/html/2605.06318#bib.bib79)\)contains annotations for the tasks of question\-answering, offensiveness rating, text rewriting/style transfer, and politeness rating\. We use the offensiveness rating portion, consisting of Reddit comments labeled on a 5\-point scale for offensiveness\. It provides the annotator characteristics of gender, age, ethnicity, politics, occupation, and education

D3CODE\.The D3CODE dataset\(Davaniet al\.,[2024](https://arxiv.org/html/2605.06318#bib.bib67)\)contains online comments annotated for toxicity on a 5\-point scale\. It provides the annotator characteristics of gender, age, country of residence, geo\-cultural region, moral foundations, and perceived socio\-economic status\.

## 4Methods

Annotator Characteristics\.We use all annotator characteristics in the respective datasets\.

Linguistic Features\.To assess item\-side factors of annotation variation, we use general linguistic features and domain\-specific lexical signals\. We extract linguistic features usingelfen\(Maurer,[2026](https://arxiv.org/html/2605.06318#bib.bib56)\)\. Overall, we consider 327 features from the 11 provided feature areas\. We provide a full overview of the used features in Appendix[A](https://arxiv.org/html/2605.06318#A1)\.

To find domain\-specific lexical signals such as vulgarity or hateful slurs, we use the English portions of the Harassment Corpus\(Rezvanet al\.,[2018](https://arxiv.org/html/2605.06318#bib.bib33)\), which contains harassment\-related words in six categories \(sexual, appearance\-related, intellectual, political, racial, andcombined\), Hurtlex\(Bassignanaet al\.,[2018](https://arxiv.org/html/2605.06318#bib.bib32)\), which containswords to hurtin three course\-grained categories \(Negative stereotypes,hate words and slurs beyond stereotypes, andOther words and insults\) and 17 fine\-grained categories,Wiegandet al\.\([2018](https://arxiv.org/html/2605.06318#bib.bib85)\)’s base lexicon of abusive words, and Hatebase333[https://hatebase\.org/](https://hatebase.org/)\. We extract the number of offensive words from the four hate speech lexicons and the number of words from fine\-grained categories per item\. We list all domain\-specific lexicon\-based features in Appendix[B](https://arxiv.org/html/2605.06318#A2)\.

Preprocessing\.We filter the datasets to include only items that were annotated by at least three annotators who annotated at least 10 items, respectively\. The linguistic features are token\-normalized if they are occurrence\-count features \(e\.g\., number of nouns\)\. Then, all linguistic features are standardized to have a mean of0and a standard deviation of11\. For the annotator characteristics, we remove all missing and "prefer not to answer" annotators, and all annotator IDs potentially mapping to multiple people \(i\.e\., a single annotator ID for multiple conflicting characteristics\)\. We harmonizegenderandeducationacross datasets, and re\-code multiple\-choice combinations ofraceto keep the number of categories manageable\. Finally, we remove annotators with multiple assignments forsexualityandreligion\. We provide a more detailed description of the preprocessing choices in Appendix[C](https://arxiv.org/html/2605.06318#A3)\. Table[2](https://arxiv.org/html/2605.06318#S4.T2)shows the number of items, annotations, and annotators per item after preprocessing per dataset\.

Table 2:Dataset size, number of annotations, total number of annotators, and mean number of annotators per item±\\pmst\. deviation per dataset after preprocessing\.Linguistic Feature Pre\-Selection\.We use a multi\-step semi\-automatic feature selection method for the linguistic features to reduce multicollinearity and the number of features while retaining interpretability\. This process allows us to remove theoretically equivalent or empirically similar features while keeping an expressive set of features\.

Firstly, we pre\-select features by calculating pairwise Pearson correlations between all the features and retaining those that correlate lower than a threshold ofr<0\.5r<0\.5with all other features\.

Secondly, the remaining features that correlate with at least one other feature higher than the threshold are then clustered using single\-linkage multilevel clustering using the remaining features’ correlation matrix as a similarity matrix\. We inspect the resulting clusters for consistency with theoretical expectations \(i\.e\., that features measuring similar properties end up in the same clusters\) and pick one feature per cluster444In practice, we pick a feature well\-used in the literature or the one most intuitively interpretable\. We explain this in more detail on the example ofPOPQUORNin Appendix[G](https://arxiv.org/html/2605.06318#A7)\.\.

Regression Modeling\.We use Bayesian multilevel regression with regularization priors because it allows for selecting features among a large number of inter\-correlated features\. We place Horseshoe priors\(Piironen and Vehtari,[2017](https://arxiv.org/html/2605.06318#bib.bib53); Carvalhoet al\.,[2009](https://arxiv.org/html/2605.06318#bib.bib52)\)on all fixed effects regression coefficients\. This has the effect that small and uncertain effects are aggressively pushed towards0while larger and more certain effects escape this shrinkage\(Piironen and Vehtari,[2017](https://arxiv.org/html/2605.06318#bib.bib53); van Erpet al\.,[2019](https://arxiv.org/html/2605.06318#bib.bib55)\)\.

Annotation behavior, measured by individual annotators’ labeling decisions on the respective ordinal scale, is modeled as a function of the main effects of the linguistic and annotator features, the interactions among the annotator features, and the interactions between linguistic and annotator features\. To incorporate the partially cross\-classified data structure, random intercepts for the items and annotators are included\.We treat the outcome of interest as continuous and model it with a Gaussian likelihood and an identity link function\.

Even though the horseshoe prior aggressively shrinks small and uncertain effects toward 0, the estimates and their credible intervals are not exactly 0\. Therefore, we consider those effects that have 90% posterior credible intervals not overlapping with 0 as thesurvivingeffects\.

## 5Comparing Datasets

ForMHS\(hate speech\) andPOPQUORN\(offensiveness\), we analyze the annotation behavior in the full datasets\. This allows us to compare tendencies in annotation behavior for different datasets for related tasks\. Figure[2](https://arxiv.org/html/2605.06318#S5.F2)shows the surviving effects forPOPQUORN, and Figure[4](https://arxiv.org/html/2605.06318#S5.F4)forMHS\.

### 5\.1POPQUORN

We findno surviving annotator characteristics555While this is in contrast to the findings ofPei and Jurgens\([2023](https://arxiv.org/html/2605.06318#bib.bib79)\), this may be a direct consequence of adding a random intercept for the annotators\. For a more in\-depth analysis regarding this, see Appendix[K](https://arxiv.org/html/2605.06318#A11)\.\.Most surviving effects are linguistic features \(5/7\)\. The linguistic survivors we find can be grouped into three components of variation in harmful language annotation: Firstly, explicit and phenomenon\-characteristic lexical cues such as negative sentiment \(n\_negative\_sentiment\) and the presence of hateful, offensive, or vulgar tokens \(n\_hateful\)\. These results are relevant from a harmful language research perspective, as they confirm assumptions about certain cues and affective dimensions of such language\.

Secondly, topical and world knowledge, as indicated by named entities \(n\_norp, i\.e\., nationalities, religious or political groups\)\. Inspection of instances with a high number of such entities reveals that they often mention instances related to controversial topics\. We found instances mentioningPalestiniansandIsrael\) and frequently target ethnic and religious groups \(e\.g\.,Jews,Muslims\)\. On the one hand, such findings are relevant from a human label variation perspective, as they directly tie tendencies in annotations to broader discussions in society\. On the other hand, they urge caution from a modeling/content moderation perspective: models may pick up on entity cues, but a functioning content moderation system should not be sensitive to all texts mentioning Muslims or Jews, but rather flag certain topics for human review\.

Thirdly, pragmatic and discourse phenomena such as irony may counterintuitively be related to lexical cues one usually associates with harmful language\. For instance, a high number of tokens related to moral or behavioral defects \(n\_dmc\) is associated with lower offensiveness annotations\. Inspection reveals that items with such tokens often are about the author’s opposing views on certain positions on moral grounds or are ironic666See examples in Appendix[F](https://arxiv.org/html/2605.06318#A6)\.\.

We findtwo surviving interactionsbetween annotator characteristics and linguistic features\. The first indicates differences in age groups for lexical cues \(age\.L:n\_hateful, see Figure[3](https://arxiv.org/html/2605.06318#S5.F3)777Find other interaction plots for plots discussed in this paper in Appendix[O](https://arxiv.org/html/2605.06318#A15)\.\)888For ordinal predictors\.Lrefers to linear, and\.Qto quadratic effects\.\. While the presence of hateful words does not correlate with stronger differences in annotation choices at younger ages, the older the annotators get, the more the presence of such cues influences annotation choices\. Not only does this finding lend itself as an interpretation for potential variation between annotator groups, it also raises questions to the reception of implicit hate speech \(i\.e\., not expressed on the surface\) for the different groups\. The second surviving interaction shows gender differences for world knowledge \(gender:n\_person,n\_personis the number of proper names such asTrump\)\.

![Refer to caption](https://arxiv.org/html/2605.06318v1/x2.png)Figure 2:Posterior estimates for the surviving effects forPOPQUORN\. The dots represent the median posterior estimates, and horizontal bars represent the 95% highest density interval\.![Refer to caption](https://arxiv.org/html/2605.06318v1/x3.png)Figure 3:Model predictions for the interactionage:n\_hateful\(POPQUORN\)\. Labels \(1,0,\-1\) refer to SD from mean \(0\) forn\_hateful\. The dots represent the mean posterior estimates, and vertical bars represent the 95% highest density interval\.
### 5\.2MHS

As inPOPQUORN, we findno surviving annotator characteristics\. Again,most of the surviving effects are linguistic features \(11/15\)\. Among them, we find the same high\-level patterns as inPOPQUORN\. Lexical cues show task\-specific tendencies: ethnic slurs \(n\_ps\) and words related to female genitalia \(n\_asf\)\. Similarly, inspection of items with more tokens with a high auditory \(n\_high\_auditory\) and olfactory grounding \(n\_high\_olfactory\) reveals these features to capture conventionalized vulgar expressions liketrash,shit, orfuck, rather than a literal smell or sound relation of the text\. On the pragmatic/discourse level, we find surviving features related to complexity and specificity \(avg\_synsets\_noun,n\_polysyllables\), and stereotypical or derogatory descriptions of people such astheir cultureorthe Black Guy, indicated by a relatively high number of determiners \(n\_det\)\.

We finda strong intersectional effect of ideology and age, Fig\.[15](https://arxiv.org/html/2605.06318#A15.F15): it shows that towards the extremes of the ideology, the difference between age groups in terms of offensiveness ratings becomes more stark, with flipped effects \(the more conservative, older people rate lower on average, and the opposite for liberals\)\. This is relevant from a label variation, a harmful language, and a content moderation perspective, as it shows that even when individual sociodemographic proxies do not indicate systematic differences, their interaction may, pointing to the impact of the lived experiences of certain subgroups\.

We also findthree interactions between linguistic features and annotator characteristics\. This, again, reveals task\-specific differences: the association between ratings and lexical cues, such as emotion intensity for surprise, is dependent on ideology \(ideology\.C:n\_high\_surprise; tokens with high surprise intensity include examples likeshockingly, ordisaster\), and the association between ratings and number of hedges \(tokens indicating speaker uncertainty\) on education level \(education\.Q:n\_hedges\)\. Finally, we find an interaction of Hindu annotators and the number of tokens not assignable to a standard POS tag \(religionhindu:n\_x\)\. Inspection of items and annotations points to an artifact of data and annotator sampling: Coincidentally, items with a high number of tokens with theXtag \(non\-standard words/slang, misspellings, non\-Latin characters\) had more Hindu annotators who tendentially rated these items as not hateful\. From a modeling perspective, such an effect can be viewed as a concrete example of data quality issues that such an analysis can reveal\. It is reasonable to assume that language models may pick up on such spurious signals, and thus, one may want to be aware of them in their data\.

![Refer to caption](https://arxiv.org/html/2605.06318v1/x4.png)Figure 4:Posterior estimates for the surviving effects forMHS\. The dots represent the median posterior estimates, and horizontal bars represent the 95% highest density interval\.
### 5\.3Discussion

While we only find one common effect, the effect of negative sentiment tokens, this is expected given different tasks, annotation guidelines, items, and annotators\. Interestingly, however, both models show effects given the presence of offensive, hateful, and vulgar words, validating the informativeness of such lexical signals found byRizziet al\.\([2025](https://arxiv.org/html/2605.06318#bib.bib80)\)for related tasks\. Overall, the findings point to dataset, annotator set, and task\-specific tendencies, urging caution when applying findings of one specific configuration to another\. Finally, the surviving interaction effects of annotator characteristics and linguistic features underline the importance of taking the variation between annotators and items, and their interdependences into account\.

## 6Simulating New Annotators

Given the large number of annotators per item,D3CODEis well\-suited to assess whether tendencies in annotation behavior persist when we collect more annotations on the same items from annotators with similar distributions of annotator characteristics to the original annotator population\. We randomly sample half of the annotators\. Our halved samples retain similar distributions of annotator characteristics and numbers of annotations per item\. Per half, we fit one model\.

### 6\.1Results

Figure[5](https://arxiv.org/html/2605.06318#S6.F5)shows the posterior estimates for the surviving effects999Note that the effect sizes may not be strictly comparable, since moral foundations such ascareare not standardized, while linguistic features are\.for both sets of annotators forD3CODE\. By itself,one annotator characteristic survives in both models, the moral foundation dimensioncare\. The moral foundations questionnaire differs from other annotator characteristics insofar as it surveys moral intuitions from people rather than measuring innate characteristics such as age or gender\. Care, specifically, scores high for annotators who consider protecting vulnerable individuals from emotional and physical harm as very important, and is associated with empathy and compassion\(Grahamet al\.,[2013](https://arxiv.org/html/2605.06318#bib.bib62)\)\. For both models, we findno surviving linguistic features\.

We findfour intersectional effects: For annotators from Egypt with age and self\-reported socio\-economic status \(SES\), and for annotators from China with SES and the moral foundation of equality, which scores high for annotators who believe that all people should be treated equally\.

Across both models,roughly 40% of the surviving effects are interactions between lexical cue features and annotator characteristics\. We take this as a strong reason to take complex relationships of personal identity and lived experiences, and the perceptions of certain texts into account\. For example, women may rate items with slurs related to prostitution higher because they are gendered and target women, asgenderwoman:n\_prhints at\. This makes a case for considering label variation as a signal rather than noise: If women differ systematically for certain items, this should not get flattened away by label aggregation\. Similarly, from a content moderation/modeling perspective, such perceptions of targeted groups may be the focus of interest\(Fleisiget al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib68)\)\.

Interactions of the annotator characteristics, moral foundations, SES, or country, with several linguistic features related to length and complexity \(e\.g\.,care:n\_long\_wordsandEgypt:n\_tokens\) point to a complex relationship of items and annotators\. On the one hand, from a label variation and harmful language research perspective, they may point to specific items that are particularly prone to varied perceptions, for instance, longer texts\. On the other hand, if such interactions are only present for certain samples of annotators, but not necessarily for the general population \(e\.g\.,Egypt:n\_abusiveis only a survivor for the first half of the annotators\), this may be problematic from a modeling and content moderation perspective: A model working well on a "representative" sample of the target annotator population may still not be well\-equiped for the patterns of another sample, or the population overall\. This raises questions about how to model within\-group variation, and which learned patterns to consider generalizable\.

![Refer to caption](https://arxiv.org/html/2605.06318v1/x5.png)Figure 5:Posterior estimates for the surviving effects for the two halves of the annotators ofD3CODE\.
### 6\.2Discussion

The results show that roughly half of the surviving effects can be found for both models\. This indicates that, given the same text items, similar distributions of annotator characteristics, and the same guidelines, we can expect some level of aligned labeling behavior\. The other half of the surviving effects, however, points to a certain level of intra\-group variation\. Overall, our results indicate that label variation, to a large extent, may be driven by individual differences for certain items than by annotator characteristics or certain items alone\. This is relevant from multiple angles: From a label variation perspective, this underlines the necessity to look beyond individual proxies and understand annotation as an interactive process between annotators and items to disentangle variation\. From a modeling and content moderation perspective, this raises the question of who labeled the items we are training on, and of their linguistic tendencies\.

## 7Simulating New Annotators and Items

![Refer to caption](https://arxiv.org/html/2605.06318v1/x6.png)Figure 6:Posterior estimates for surviving effects ofCTDP\. We show effects that survive for≥2\\geq 2subsets\.Given the high number of items and annotators,CTDPis fitting to simulate a batched annotation scenario to assess whether annotation behavior tendencies hold across batches, i\.e\., on previously unannotated items with completely new annotators\.

We reconstruct the batches by dividing the data into subsets of items that share the samekkannotator IDs\. We divideCTDPinto subsets of 300 each\. We run one model on each of 3 randomly drawn subsets, totaling∼16%\\sim 16\\%of the full dataset\.

### 7\.1Results

Figure[6](https://arxiv.org/html/2605.06318#S7.F6)shows effects that survive for at least two splits\. Full results are presented in §[E](https://arxiv.org/html/2605.06318#A5)\. The onlytwo common surviving annotator characteristicsacross all three splits are annotator attitudes towards specific items\. The finer an annotator is to see a given item, the lower the toxicity annotation \(fine\_to\_see\.L\)\. Conversely, the more an annotator thinks a post should be removed, the higher the toxicity annotation \(remove\.L\)101010The effects are negative here because the scale is flipped; ”This comment should be removed” is the lowest level, while ”This comment should be allowed” is the highest\.\. These annotator characteristics differ from sociodemographic proxies or general attitudes and moral foundations insofar as they measure item\-specific perceptions and attitudes\. Conceptually, they are thus similar to item feature\-annotator characteristics interactions\.

We findno linguistic features among the surviving effectsin≥2\\geq 2splits\. Allsix of the surviving interactionsinclude two annotator characteristics\. Two of the interactions reveal differences between groups along axes of general attitudes with item\-specific attitudes \(toxic\_comments\_problem, whether annotators think that toxic problems online are a problem interacts withfine\_to\_see, andreligion\_importanceinteracts withremove\)\. Furthermore, two reflect differences in lived realities and experiences, as shown by the interaction of whether an annotator has been the target of toxic comments and to what extent they think technology has a positive impact on society \(technology\_impact\.L:been\_targetTrue\), and whether they personally use online forums and to what extent they think a given item should be removed \(uses\_forumsTRUE:remove\.L\)\.

### 7\.2Discussion

While there are no linguistic features or interactions with them among the surviving effects common to at least two models, the two main surviving effects conceptually share some overlap\. In contrast to sociodemographic variables like gender or general attitudes, they fundamentally reflect item\-specific attitudes of annotators\. Overall, the survivors point to the utility of phenomenon\- and item\-specific attributes reflecting the lived experiences and interactive perceptions in the annotation\. Such attributes, as the results of this analysis in comparison with the two previous analyses indicate, may account for more variance in annotation behavior than broad sociodemographic features or even linguistic\-sociodemographic interactions\. This setup particularly lends itself to content moderation, as the phenomenon and item\-specific annotator attitudes allow for capturing nuanced perceptions of when and why a given person may rate an item as toxic\.

## 8Conclusion

In this work, we presented a series of analyses on four unaggregated harmful language datasets\. Using multilevel Bayesian models with a rich set of linguistic features, annotator characteristics, and their interactions, we found that, while there were differences across datasets, interactions of item features and annotator characteristics \(Sections[5](https://arxiv.org/html/2605.06318#S5)&[6](https://arxiv.org/html/2605.06318#S6)\) or item\-specific attitude effects \(Section[7](https://arxiv.org/html/2605.06318#S7)\) were present across analyses\. This has consequences for data collection efforts: Firstly, some items may need more annotations by a more diverse or focused set of annotators, while others may be reasonably uncontroversial across annotator groups\. For example, for items containing slurs targeted at women, we may want to make sure to get them annotated by many women of different backgrounds\. Secondly, our results point to the need for reflection on which annotator characteristics and attitudes should be collected, particularly on an item level\. This depends on the purpose a given dataset is collected for, and which questions it is supposed to answer\. Finally, since data defines all parts of the NLP system lifecycle, our findings urge for engaging critically with the assumptions of what annotated data aims to capture and their links to model behavior, both in training and evaluation scenarios\.

## Limitations

While our work assesses annotation variation in a principled manner, taking into account the structure of annotations and predictors on two levels, and their interactions, our work is limited in multiple regards\.

Firstly, we only assess one set of tasks, offensive/hateful language detection, on one language, English\. While our findings do not claim any universality, we still urge caution, given that findings may not transfer across languages, tasks, and domains\.

Secondly, our assessment is limited to available annotator and item characteristics\. There may be arbitrarily more, some of which may not be practically and reliably measurable \(e\.g\., annotator mood, or short distractions\)\. In a similar vein, our work does not account for batch\-level effects111111We note that this limitation directly arises from how data is documented: None of the datasets we assess includes batch IDs, and they may not always be easily reconstructable\., intra\-annotator agreement\(Abercrombieet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib63)\), the impact of the annotation setup \(e\.g\., the annotation environment\(Kernet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib70)\), the order in which items are shown\(Becket al\.,[2024a](https://arxiv.org/html/2605.06318#bib.bib65)\)\), or the effect of the annotation itself\. For instance, verbally thinking about one’s assessment may impact the annotator’s own emotional/intuitive response, given that language may modulate perception and cognition\(Lupyan,[2012](https://arxiv.org/html/2605.06318#bib.bib59)\)\. Different formats may lead to different response distributions\(c\.f\. findings from survey methodology showing systematic differences between numbers and labelings of response options, Weijterset al\.,[2010](https://arxiv.org/html/2605.06318#bib.bib58)\)\.

Moreover, annotator selection practices may impact the interpretation of findings\. While over a whole dataset, annotator socio\-demographics may be reflective of the whole target population \(e\.g\., residents of a given country, or English speakers\), this is not the case for any given batch and item121212We note that, likely, this is due to costs, as having a set of annotators with socio\-demographics reflective of the whole target population necessarily requires a high number of annotators per batch/item\.\. This may have a considerable impact on the estimated effects, as many batches and items will only ever be annotated by the socio\-demographic majority groups\. Especially in tasks like hate speech detection, this comes with real implications for groups targeted by hateful and offensive language\(c\.f\. Fleisiget al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib68)\)\.

The extent of exploration is limited by our available resources and, more fundamentally, by available implementations131313We discuss observations on these limitations and how we handle them in Appendix[M](https://arxiv.org/html/2605.06318#A13)\.\. In our experiments, we hit implementation limits in how complex a model can be\. Given that, thus, we are reducing the sample size for some datasets by splitting them \(Experiment 3, Section[7](https://arxiv.org/html/2605.06318#S7)\), it is very likely that the posteriors are broader than when using the full datasets\. As such, it is possible that for the smaller\-sample scenarios, some effects do not survive that would have survived in the full\-data scenario\. Including random slopes and not only random intercepts may be informative and can be argued for, in our setup and given our resources, but it is infeasible\. Similarly, item features never occur in isolation, and interactions of uncorrelated features may reveal interesting items\. Due to Comparisons between coefficients in Experiment 2 \(Section[6](https://arxiv.org/html/2605.06318#S6)\) are not meaningful since they are not all standardized \(continuous annotator characteristics are not standardized\)\. The interpretations, however, would be limited even if they were, given that an effect ofone standard deviation from the mean of the moral foundation care, for example, is hard to interpret, and even harder to compare to effects of other such variables\. Given that this is an ongoing discussion, whether and when coefficients are strictly comparable, we urge caution when trying to compare coefficients and deliberately refrain from it in this work\.

Finally, we note that our work is only one of many ways to analyze factors of variation\. And at that, it is quite a conservative approach, especially with respect to what we consider a surviving effect, and how we operationalize the exploration\. Decisions at all steps may impact findings\. We discuss decision rationales and alternatives further in Appendix[I](https://arxiv.org/html/2605.06318#A9)\.

## Ethical Considerations

The work presented in this paper belongs to the perspectivist framework, whose agenda we support fully\. We are, however, aware that the claim that multiple \(and many\) annotations per item should be collected can be problematic for research groups with limited funding – and this should not become an economic bottleneck\. We believe, however, that works like ours show that statistical analysis of existing datasets \(without collecting new data\) can provide fundamental insights\. Moreover, our methodology, bringing together thewhoand thewhat, can also inform a more efficient way of using annotation budget – one that exploits systematic patterns of interaction between annotator and item characteristics to improve dataset quality by focusing on specific sets of items\.

## Contributions

In the following, we list the contributions of each author of this paper according to the CRediT taxonomy141414[https://credit\.niso\.org/](https://credit.niso.org/)\.

- •Conceptualization: Maximilian Maurer, Maximilian Linde, Gabriella Lapesa
- •Data Curation: Maximilian Maurer, Maximilian Linde
- •Formal Analysis: Maximilian Maurer, Maximilian Linde
- •Funding Acquisition: All authors are employed in household positions at GESIS and used associated funds\.
- •Investigation: Maximilian Maurer, Maximilian Linde, Gabriella Lapesa
- •Methodology: Maximilian Maurer, Maximilian Linde, Gabriella Lapesa
- •Project Administration: Maximilian Maurer, Maximilian Maurer, Gabriella Lapesa
- •Resources: All authors are employed in household positions at GESIS and used associated resources\.
- •Software: Maximilian Maurer, Maximilian Linde
- •Supervision: Gabriella Lapesa
- •Validation: Maximilian Maurer, Maximilian Linde, Gabriella Lapesa
- •Visualization: Maximilian Maurer
- •Writing: Maximilian Maurer, Maximilian Linde, Gabriella Lapesa

## References

- G\. Abercrombie, D\. Hovy, and V\. Prabhakaran \(2023\)Temporal and second language influence on intra\-annotator agreement and stability in hate speech labelling\.InProceedings of the 17th Linguistic Annotation Workshop \(LAW\-XVII\),J\. Prange and A\. Friedrich \(Eds\.\),Toronto, Canada,pp\. 96–103\.External Links:[Link](https://aclanthology.org/2023.law-1.10/),[Document](https://dx.doi.org/10.18653/v1/2023.law-1.10)Cited by:[Limitations](https://arxiv.org/html/2605.06318#Sx1.p3.1)\.
- J\. Anderson \(1981\)Analysing the Readability of English and Non\-English Texts in the Classroom with Lix\.\.Seventh Australian Reading Association Conference,pp\. 1–13\.External Links:[Link](https://eric.ed.gov/?id=ED207022)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px2.p1.1)\.
- V\. Basile, M\. Fell, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, M\. Poesio, and A\. Uma \(2021a\)We need to consider disagreement in evaluation\.InProceedings of the 1st Workshop on Benchmarking: Past, Present and Future,K\. Church, M\. Liberman, and V\. Kordoni \(Eds\.\),Online,pp\. 15–21\.External Links:[Link](https://aclanthology.org/2021.bppf-1.3/),[Document](https://dx.doi.org/10.18653/v1/2021.bppf-1.3)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1)\.
- V\. Basile, M\. Fell, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, M\. Poesio, and A\. Uma \(2021b\)We Need to Consider Disagreement in Evaluation\.InProceedings of the 1st Workshop on Benchmarking: Past, Present and Future,K\. Church, M\. Liberman, and V\. Kordoni \(Eds\.\),pp\. 15–21\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.bppf-1.3),[Link](https://aclanthology.org/2021.bppf-1.3)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p1.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1)\.
- E\. Bassignana, V\. Basile, V\. Patti,et al\.\(2018\)Hurtlex: a multilingual lexicon of words to hurt\.InCEUR Workshop proceedings,Vol\.2253,pp\. 1–6\.External Links:[Link](https://ceur-ws.org/Vol-2253/paper49.pdf)Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p3.1)\.
- J\. Beck, S\. Eckman, B\. Ma, R\. Chew, and F\. Kreuter \(2024a\)Order effects in annotation tasks: further evidence of annotation sensitivity\.InProceedings of the 1st Workshop on Uncertainty\-Aware NLP \(UncertaiNLP 2024\),R\. Vázquez, H\. Celikkanat, D\. Ulmer, J\. Tiedemann, S\. Swayamdipta, W\. Aziz, B\. Plank, J\. Baan, and M\. de Marneffe \(Eds\.\),St Julians, Malta,pp\. 81–86\.External Links:[Link](https://aclanthology.org/2024.uncertainlp-1.8/),[Document](https://dx.doi.org/10.18653/v1/2024.uncertainlp-1.8)Cited by:[Limitations](https://arxiv.org/html/2605.06318#Sx1.p3.1)\.
- T\. Beck, H\. Schuff, A\. Lauscher, and I\. Gurevych \(2024b\)Sensitivity, performance, robustness: deconstructing the effect of sociodemographic prompting\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 2589–2615\.External Links:[Link](https://aclanthology.org/2024.eacl-long.159/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.159)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- L\. Biester, V\. Sharma, A\. Kazemi, N\. Deng, S\. Wilson, and R\. Mihalcea \(2022\)Analyzing the effects of annotator gender across NLP tasks\.InProceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022,G\. Abercrombie, V\. Basile, S\. Tonelli, V\. Rieser, and A\. Uma \(Eds\.\),Marseille, France,pp\. 10–19\.External Links:[Link](https://aclanthology.org/2022.nlperspectives-1.2/)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- M\. Brysbaert, P\. Mandera, S\. F\. McCormick, and E\. Keuleers \(2019\)Word prevalence norms for 62,000 English lemmas\.Behavior Research Methods51,pp\. 467–479\.External Links:[Link](https://doi.org/10.3758/s13428-018-1077-9),[Document](https://dx.doi.org/10.3758/s13428-018-1077-9)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- M\. Brysbaert, A\. B\. Warriner, and V\. Kuperman \(2014\)Concreteness ratings for 40 thousand generally known English word lemmas\.Behavior Research Methods46,pp\. 904–911\.External Links:[Link](https://doi.org/10.3758/s13428-013-0403-5),[Document](https://dx.doi.org/10.3758/s13428-013-0403-5)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Bürkner and M\. Vuorre \(2019\)Ordinal regression models in psychology: a tutorial\.Advances in Methods and Practices in Psychological Science2\(1\),pp\. 77–101\.External Links:[Document](https://dx.doi.org/10.1177/2515245918823199)Cited by:[§I\.3](https://arxiv.org/html/2605.06318#A9.SS3.SSS0.Px2.p1.1)\.
- P\. Bürkner \(2017\)brms: an R package for Bayesian multilevel models using Stan\.Journal of Statistical Software80\(1\),pp\. 1–28\.External Links:[Document](https://dx.doi.org/10.18637/jss.v080.i01)Cited by:[§J\.2](https://arxiv.org/html/2605.06318#A10.SS2.p2.1)\.
- F\. Cabitza, A\. Campagner, and V\. Basile \(2023\)Toward a perspectivist turn in ground truthing for predictive computing\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37:6,pp\. 6860–6868\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/25840),[Document](https://dx.doi.org/10.1609/aaai.v37i6.25840)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1)\.
- J\. B\. Carroll \(1964\)Language and thought\.Prentice\-Hall\.Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- C\. M\. Carvalho, N\. G\. Polson, and J\. G\. Scott \(2009\)Handling Sparsity via the Horseshoe\.InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics,pp\. 73–80\.External Links:ISSN 1938\-7228Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p8.1)\.
- M\. Coleman and T\. L\. Liau \(1975\)A computer readability formula designed for machine scoring\.\.Journal of Applied Psychology60\(2\),pp\. 283\.External Links:[Link](https://doi.org/10.1037/h0076540),[Document](https://dx.doi.org/10.1037/h0076540)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px2.p1.1)\.
- M\. A\. Covington and J\. D\. McFall \(2010\)Cutting the Gordian Knot: The Moving\-Average Type–Token Ratio \(MATTR\)\.Journal of Quantitative Linguistics17\(2\),pp\. 94–100\.External Links:[Document](https://dx.doi.org/10.1080/09296171003643098),[Link](https://doi.org/10.1080/09296171003643098),https://doi\.org/10\.1080/09296171003643098Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- A\. Davani, M\. Díaz, D\. Baker, and V\. Prabhakaran \(2024\)D3CODE: disentangling disagreements in data across cultures on offensiveness detection and evaluation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 18511–18526\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1029/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1029)Cited by:[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.6.4.1),[§3](https://arxiv.org/html/2605.06318#S3.p5.1)\.
- A\. Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.External Links:[Link](https://aclanthology.org/2022.tacl-1.6/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00449)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1)\.
- M\. de Marneffe, C\. D\. Manning, J\. Nivre, and D\. Zeman \(2021\)Universal Dependencies\.Computational Linguistics47\(2\),pp\. 255–308\.External Links:[Link](https://aclanthology.org/2021.cl-2.11/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00402)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px6.p1.1),[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px8.p1.1)\.
- V\. Diveica, P\. M\. Pexman, and R\. J\. Binney \(2023\)Quantifying social semantics: An inclusive definition of socialness and ratings for 8388 English words\.Behavior Research Methods55\(2\),pp\. 461–473\.External Links:[Link](https://doi.org/10.3758/s13428-022-01810-x),[Document](https://dx.doi.org/10.3758/s13428-022-01810-x)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- D\. Dugast \(1978\)Sur quoi se fonde la notion d’etendue theoratique du vocabulaire?\.Le francais Modern46\(1\),pp\. 25\.External Links:[Link](https://cir.nii.ac.jp/crid/1370857593755305748)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- E\. Fleisig, R\. Abebe, and D\. Klein \(2023\)When the majority is wrong: modeling annotator disagreement for subjective tasks\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6715–6726\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.415/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.415)Cited by:[§6\.1](https://arxiv.org/html/2605.06318#S6.SS1.p3.1),[Limitations](https://arxiv.org/html/2605.06318#Sx1.p4.1)\.
- J\. Graham, J\. Haidt, S\. Koleva, M\. Motyl, R\. Iyer, S\. P\. Wojcik, and P\. H\. Ditto \(2013\)Moral foundations theory: the pragmatic validity of moral pluralism\.InAdvances in Experimental Social Psychology,P\. Devine and A\. Plant \(Eds\.\),Vol\.47,pp\. 55–130\.External Links:ISSN 0065\-2601,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/B978-0-12-407236-7.00002-4),[Link](https://www.sciencedirect.com/science/article/pii/B9780124072367000024)Cited by:[§6\.1](https://arxiv.org/html/2605.06318#S6.SS1.p1.1)\.
- G\. Herdan \(1955\)A new derivation and interpretation of Yule’s ‘Characteristic’ K\.Zeitschrift für angewandte Mathematik und Physik ZAMP6,pp\. 332–339\.External Links:[Link](https://doi.org/10.1007/BF01587632),[Document](https://dx.doi.org/h10.1007/BF01587632)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- G\. Herdan \(1964\)Quantitative linguistics\.Butterworths\.External Links:ISBN 9780208001962,LCCN 65001571Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- C\. Homan, G\. Serapio\-Garcia, L\. Aroyo, M\. Diaz, A\. Parrish, V\. Prabhakaran, A\. Taylor, and D\. Wang \(2024\)Intersectionality in AI safety: using multilevel models to understand diverse perceptions of safety in conversational AI\.InProceedings of the 3rd Workshop on Perspectivist Approaches to NLP \(NLPerspectives\) @ LREC\-COLING 2024,G\. Abercrombie, V\. Basile, D\. Bernadi, S\. Dudy, S\. Frenda, L\. Havens, and S\. Tonelli \(Eds\.\),Torino, Italia,pp\. 131–141\.External Links:[Link](https://aclanthology.org/2024.nlperspectives-1.15/)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- C\. Kern, S\. Eckman, J\. Beck, R\. Chew, B\. Ma, and F\. Kreuter \(2023\)Annotation sensitivity: training data collection methods affect model performance\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 14874–14886\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.992/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.992)Cited by:[Limitations](https://arxiv.org/html/2605.06318#Sx1.p3.1)\.
- J\. P\. Kincaid, R\. P\. Jr\. Fishburne, R\. L\. Rogers, and B\. S\. Chissom \(1975\)Derivation Of New Readability Formulas \(Automated Readability Index, Fog Count And Flesch Reading Ease Formula\) For Navy Enlisted Personnel\.Technical reportVol\.56,Institute for Simulation and Training\.External Links:[Link](https://stars.library.ucf.edu/istlibrary/56)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px2.p1.1)\.
- J\. Kocoń, A\. Figas, M\. Gruza, D\. Puchalska, T\. Kajdanowicz, and P\. Kazienko \(2021\)Offensive, aggressive, and hate speech analysis: from data\-centric to human\-centered approach\.Information Processing & Management58\(5\),pp\. 102643\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2021.102643),[Link](https://www.sciencedirect.com/science/article/pii/S0306457321001333)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1)\.
- D\. Kumar, P\. G\. Kelley, S\. Consolvo, J\. Mason, E\. Bursztein, Z\. Durumeric, K\. Thomas, and M\. Bailey \(2021\)Designing toxic content classification for a diversity of perspectives\.InSeventeenth Symposium on Usable Privacy and Security \(SOUPS 2021\),pp\. 299–318\.External Links:ISBN 978\-1\-939133\-25\-0,[Link](https://www.usenix.org/conference/soups2021/presentation/kumar)Cited by:[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.7.5.1),[§3](https://arxiv.org/html/2605.06318#S3.p2.1)\.
- V\. Kuperman, H\. Stadthagen\-Gonzalez, and M\. Brysbaert \(2012\)Age\-of\-acquisition ratings for 30,000 English words\.Behavior Research Methods44,pp\. 978–990\.External Links:[Link](https://doi.org/10.3758/s13428-012-0210-4),[Document](https://dx.doi.org/10.3758/s13428-012-0210-4)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- S\. Larimore, I\. Kennedy, B\. Haskett, and A\. Arseniev\-Koehler \(2021\)Reconsidering annotator disagreement about racist language: noise or signal?\.InProceedings of the Ninth International Workshop on Natural Language Processing for Social Media,L\. Ku and C\. Li \(Eds\.\),Online,pp\. 81–90\.External Links:[Link](https://aclanthology.org/2021.socialnlp-1.7/),[Document](https://dx.doi.org/10.18653/v1/2021.socialnlp-1.7)Cited by:[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.8.6.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1)\.
- N\. Lee, C\. Jung, J\. Myung, J\. Jin, J\. Camacho\-Collados, J\. Kim, and A\. Oh \(2024\)Exploring cross\-cultural differences in English hate speech annotations: from dataset construction to analysis\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 4205–4224\.External Links:[Link](https://aclanthology.org/2024.naacl-long.236/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.236)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- G\. Lupyan \(2012\)Linguistically modulated perception and cognition: the label\-feedback hypothesis\.Frontiers in psychology3,pp\. 54\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.3389/fpsyg.2012.00054)Cited by:[Limitations](https://arxiv.org/html/2605.06318#Sx1.p3.1)\.
- D\. Lynott, L\. Connell, M\. Brysbaert, J\. Brand, and J\. Carney \(2020\)The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words\.Behavior Research Methods52,pp\. 1271–1291\.External Links:[Link](https://doi.org/10.3758/s13428-019-01316-z),[Document](https://dx.doi.org/10.3758/s13428-019-01316-z)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- H\. Mass \(1972\)Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes\.Zeitschrift für Literaturwissenschaft und Linguistik2\(8\),pp\. 73\.Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- M\. Maurer \(2026\)Elfen: a python package for efficient linguistic feature extraction for natural language datasets\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),D\. Croce, J\. Leidner, and N\. S\. Moosavi \(Eds\.\),Rabat, Marocco,pp\. 61–74\.External Links:[Link](https://aclanthology.org/2026.eacl-demo.5/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-demo.5),ISBN 979\-8\-89176\-382\-1Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.p1.1),[§4](https://arxiv.org/html/2605.06318#S4.p2.1)\.
- G\. H\. Mc Laughlin \(1969\)SMOG grading\-a new readability formula\.Journal of reading12\(8\),pp\. 639–646\.Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px2.p1.1)\.
- P\. M\. McCarthy and S\. Jarvis \(2007\)Vocd: a theoretical and empirical evaluation\.Language Testing24\(4\),pp\. 459–488\.External Links:[Document](https://dx.doi.org/10.1177/0265532207080767),[Link](https://doi.org/10.1177/0265532207080767),https://doi\.org/10\.1177/0265532207080767Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- P\. M\. McCarthy and S\. Jarvis \(2010\)MTLD, vocd\-D, and HD\-D: A validation study of sophisticated approaches to lexical diversity assessment\.Behavior Research Methods42\(2\),pp\. 381–392\.External Links:[Link](https://doi.org/10.3758/BRM.42.2.381),[Document](https://dx.doi.org/10.3758/BRM.42.2.381)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- S\. M\. Mohammad and P\. D\. Turney \(2013\)CROWDSOURCING a word–emotion association lexicon\.Computational Intelligence29\(3\),pp\. 436–465\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1111/j.1467-8640.2012.00460.x),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1111/j\.1467\-8640\.2012\.00460\.xCited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px11.p1.1)\.
- S\. Mohammad and P\. Turney \(2010\)Emotions evoked by common words and phrases: using Mechanical Turk to create an emotion lexicon\.InProceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text,D\. Inkpen and C\. Strapparava \(Eds\.\),Los Angeles, CA,pp\. 26–34\.External Links:[Link](https://aclanthology.org/W10-0204/)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px11.p1.1)\.
- S\. Mohammad \(2018a\)Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 174–184\.External Links:[Link](https://aclanthology.org/P18-1017/),[Document](https://dx.doi.org/10.18653/v1/P18-1017)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px11.p1.1)\.
- S\. Mohammad \(2018b\)Word affect intensities\.InProceedings of the Eleventh International Conference on Language Resources and Evaluation \(LREC 2018\),N\. Calzolari, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, K\. Hasida, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, S\. Piperidis, and T\. Tokunaga \(Eds\.\),Miyazaki, Japan\.External Links:[Link](https://aclanthology.org/L18-1027/)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px11.p1.1)\.
- OECD, Eurostat, and UNESCO Institute for Statistics \(2015\)ISCED 2011 operational manual: guidelines for classifying national education programmes and related qualifications\.OECD Publishing,Paris\.External Links:[Document](https://dx.doi.org/10.1787/9789264228368-en),[Link](https://doi.org/10.1787/9789264228368-en)Cited by:[Appendix C](https://arxiv.org/html/2605.06318#A3.SS0.SSS0.Px2.p2.1)\.
- M\. Orlikowski, J\. Pei, P\. Röttger, P\. Cimiano, D\. Jurgens, and D\. Hovy \(2025\)Beyond demographics: fine\-tuning large language models to predict individuals’ subjective text perceptions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 2092–2111\.External Links:[Link](https://aclanthology.org/2025.acl-long.104/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.104),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- M\. Orlikowski, P\. Röttger, P\. Cimiano, and D\. Hovy \(2023\)The ecological fallacy in annotation: modeling human label variation goes beyond sociodemographics\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1017–1029\.External Links:[Link](https://aclanthology.org/2023.acl-short.88/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-short.88)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1),[§1](https://arxiv.org/html/2605.06318#S1.p4.1),[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.4.2.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- J\. Pei and D\. Jurgens \(2023\)When do annotator demographics matter? measuring the influence of annotator demographics with the POPQUORN dataset\.InProceedings of the 17th Linguistic Annotation Workshop \(LAW\-XVII\),J\. Prange and A\. Friedrich \(Eds\.\),Toronto, Canada,pp\. 252–265\.External Links:[Link](https://aclanthology.org/2023.law-1.25/),[Document](https://dx.doi.org/10.18653/v1/2023.law-1.25)Cited by:[Appendix K](https://arxiv.org/html/2605.06318#A11.p1.1),[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.9.7.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1),[§3](https://arxiv.org/html/2605.06318#S3.p4.1),[footnote 5](https://arxiv.org/html/2605.06318#footnote5)\.
- J\. Piironen and A\. Vehtari \(2017\)Sparsity information and regularization in the horseshoe and other shrinkage priors\.Electronic Journal of Statistics11\(2\),pp\. 5018–5051\.External Links:[Document](https://dx.doi.org/10.1214/17-EJS1337SI)Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p8.1)\.
- M\. Rezvan, S\. Shekarpour, L\. Balasuriya, K\. Thirunarayan, V\. L\. Shalin, and A\. Sheth \(2018\)A quality type\-aware annotated corpus and lexicon for harassment research\.InProceedings of the 10th ACM Conference on Web Science,WebSci ’18,New York, NY, USA,pp\. 33–36\.External Links:ISBN 9781450355636,[Link](https://doi.org/10.1145/3201064.3201103),[Document](https://dx.doi.org/10.1145/3201064.3201103)Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p3.1)\.
- B\. J\. Richards and D\. D\. Malvern \(1997\)Quantifying lexical diversity in the study of language development\.University of Reading, Faculty of Education and Community Studies\.Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- G\. Rizzi, P\. Rosso, and E\. Fersini \(2025\)Is a bunch of words enough to detect disagreement in hateful content?\.InProceedings of Context and Meaning: Navigating Disagreements in NLP Annotation,M\. Roth and D\. Schlechtweg \(Eds\.\),Abu Dhabi, UAE,pp\. 1–11\.External Links:[Link](https://aclanthology.org/2025.comedi-1.1/)Cited by:[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.5.3.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1),[§5\.3](https://arxiv.org/html/2605.06318#S5.SS3.p1.1)\.
- P\. Sachdeva, R\. Barreto, G\. Bacon, A\. Sahn, C\. von Vacano, and C\. Kennedy \(2022\)The measuring hate speech corpus: leveraging rasch measurement theory for data perspectivism\.InProceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022,G\. Abercrombie, V\. Basile, S\. Tonelli, V\. Rieser, and A\. Uma \(Eds\.\),Marseille, France,pp\. 83–94\.External Links:[Link](https://aclanthology.org/2022.nlperspectives-1.11/)Cited by:[§3](https://arxiv.org/html/2605.06318#S3.p3.1)\.
- M\. Sap, S\. Swayamdipta, L\. Vianna, X\. Zhou, Y\. Choi, and N\. A\. Smith \(2022\)Annotators with attitudes: how annotator beliefs and identities bias toxic language detection\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 5884–5906\.External Links:[Link](https://aclanthology.org/2022.naacl-main.431/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.431)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p2.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1)\.
- H\. S\. Sichel \(1975\)On a distribution law for word frequencies\.Journal of the American Statistical Association70\(351a\),pp\. 542–547\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1975.10482469),[Link](https://doi.org/10.1080/01621459.1975.10482469),https://doi\.org/10\.1080/01621459\.1975\.10482469Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- E\. H\. Simpson \(1949\)Measurement of Diversity\.Nature163\.External Links:[Link](https://doi.org/10.1038/163688a0)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- H\. Sun, J\. Pei, M\. Choi, and D\. Jurgens \(2025\)Sociodemographic prompting is not yet an effective approach for simulating subjective judgments with LLMs\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 845–854\.External Links:[Link](https://aclanthology.org/2025.naacl-short.71/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-short.71),ISBN 979\-8\-89176\-190\-2Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- N\. Tahaei and S\. Bergler \(2024\)Analysis of annotator demographics in sexism detection\.InProceedings of the 5th Workshop on Gender Bias in Natural Language Processing \(GeBNLP\),A\. Faleńska, C\. Basta, M\. Costa\-jussà, S\. Goldfarb\-Tarrant, and D\. Nozza \(Eds\.\),Bangkok, Thailand,pp\. 376–383\.External Links:[Link](https://aclanthology.org/2024.gebnlp-1.24/),[Document](https://dx.doi.org/10.18653/v1/2024.gebnlp-1.24)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- M\. C\. TEMPLIN \(1957\)"Certain language skills in children: their development and interrelationships"\.NED \- New edition edition, Vol\.26,University of Minnesota Press\.External Links:ISBN 9780816671045,[Link](http://www.jstor.org/stable/10.5749/j.ctttv2st)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- A\. N\. Uma, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, and M\. Poesio \(2021\)Learning from Disagreement: A Survey\.Journal of Artificial Intelligence Research72,pp\. 1385–1470\.External Links:ISSN 1076\-9757,[Document](https://dx.doi.org/10.1613/jair.1.12752),[Link](https://www.jair.org/index.php/jair/article/view/12752)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p1.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1)\.
- S\. van Erp, D\. L\. Oberski, and J\. Mulder \(2019\)Shrinkage priors for Bayesian penalized regression\.Journal of Mathematical Psychology89,pp\. 31–50\.External Links:[Document](https://dx.doi.org/10.1016/j.jmp.2018.12.004)Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p8.1)\.
- R\. Wan, J\. Kim, and D\. Kang \(2023\)Everyone’s voice matters: quantifying annotation disagreement using demographic information\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37:12,pp\. 14523–14530\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/26698)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1),[Table 1](https://arxiv.org/html/2605.06318#S2.T1.1.1.3.1.1),[§2](https://arxiv.org/html/2605.06318#S2.p2.1)\.
- T\. C\. Weerasooriya, A\. Ororbia, R\. Bhensadadia, A\. KhudaBukhsh, and C\. Homan \(2023\)Disagreement matters: preserving label diversity by jointly modeling item and annotator label distributions with DisCo\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 4679–4695\.External Links:[Link](https://aclanthology.org/2023.findings-acl.287/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.287)Cited by:[§1](https://arxiv.org/html/2605.06318#S1.p1.1)\.
- B\. Weijters, E\. Cabooter, and N\. Schillewaert \(2010\)The effect of rating scale format on response styles: the number of response categories and response category labels\.International Journal of Research in Marketing27\(3\),pp\. 236–247\.External Links:ISSN 0167\-8116,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ijresmar.2010.02.004),[Link](https://www.sciencedirect.com/science/article/pii/S0167811610000303)Cited by:[Limitations](https://arxiv.org/html/2605.06318#Sx1.p3.1)\.
- M\. Wiegand, J\. Ruppenhofer, A\. Schmidt, and C\. Greenberg \(2018\)Inducing a lexicon of abusive words – a feature\-based approach\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1046–1056\.External Links:[Link](https://aclanthology.org/N18-1095/),[Document](https://dx.doi.org/10.18653/v1/N18-1095)Cited by:[§4](https://arxiv.org/html/2605.06318#S4.p3.1)\.
- B\. Winter, G\. Lupyan, L\. K\. Perry, M\. Dingemanse, and M\. Perlman \(2024\)Iconicity ratings for 14,000\+ English words\.Behavior Research Methods56\(3\),pp\. 1640–1655\.External Links:[Link](https://doi.org/10.3758/s13428-023-02112-6)Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px3.p1.1)\.
- G\. U\. Yule \(1944\)The statistical study of literary vocabulary\.Cambridge University Press\.External Links:ISBN 9781107633711Cited by:[Appendix A](https://arxiv.org/html/2605.06318#A1.SS0.SSS0.Px5.p1.3)\.
- W\. Zhang, H\. Guo, I\. D\. Kivlichan, V\. Prabhakaran, D\. Yadav, and A\. Yadav \(2023\)A Taxonomy of Rater Disagreements: Surveying Challenges & Opportunities from the Perspective of Annotating Online Toxicity\.External Links:2311\.04345,[Link](https://arxiv.org/abs/2311.04345)Cited by:[§2](https://arxiv.org/html/2605.06318#S2.p1.1),[§2](https://arxiv.org/html/2605.06318#S2.p3.1)\.

## Appendix AFull Overview of Linguistic Features

In the following, we describe the linguistic features used in this work\. For a full overview, checkMaurer \([2026](https://arxiv.org/html/2605.06318#bib.bib56)\)\.

##### Surface\-Level Features

We extract the sequence length \(characters; both with and without whitespaces,raw\_sequence\_length, andn\_characters\), number of tokens \(n\_tokens\), sentences \(n\_sentences\), types \(n\_types\), lemmas \(n\_lemmas\), long words \(over six characters,n\_long\_words\), the number of tokens per sentence \(tokens\_per\_sentence\), characters per sentence \(characters\_per\_sentence\), and average word length \(avg\_word\_length\)\.

##### Readability Features

We extract the Gunning fog index \(gunning\_fog\), ARI \(ari\), Flesch reading ease \(flesch\_reading\_ease\), and Flesch\-Kincaid grade level\(flesch\_kincaid\_grade,Kincaidet al\.,[1975](https://arxiv.org/html/2605.06318#bib.bib26)\), the Cole\-Liau index\(cli,Coleman and Liau,[1975](https://arxiv.org/html/2605.06318#bib.bib27)\), SMOG\(smog, Mc Laughlin,[1969](https://arxiv.org/html/2605.06318#bib.bib28)\), LIX\(lix,björnsson1968läsbarhet\), and RIX\(rix, Anderson,[1981](https://arxiv.org/html/2605.06318#bib.bib30)\), the number of syllables in an item \(n\_syllables\), words with only one syllable \(n\_monosyllables\), and words with more than two syllables \(n\_polysyllables\)\.

##### Psycholinguistic Norm Features

We extract the average rating across the tokens of an item \(avg\_\{norm\}\), the average standard deviation in the human ratings across the tokens of an item \(avg\_std\_\{norm\}\), the number of tokens in an item with a high rating \(upper third of the ordinal scale,n\_high\_\{norm\}\), the number of tokens in an item with a low rating \(lower third of the ordinal scale,n\_low\_\{norm\}\), and the number of tokens with a particularly high standard deviation \(spanning over multiple thirds of the scale,n\_high\_std\_\{norm\}\) We use concreteness\(Brysbaertet al\.,[2014](https://arxiv.org/html/2605.06318#bib.bib5)\), marked byconcreteness\(e\.g\., inavg\_concreteness\), word prevalence\(Brysbaertet al\.,[2019](https://arxiv.org/html/2605.06318#bib.bib6)\), marked byprevalence, Age\-of\-Acquisition norms\(Kupermanet al\.,[2012](https://arxiv.org/html/2605.06318#bib.bib7)\), marked byaoa, Socialness norms\(Diveicaet al\.,[2023](https://arxiv.org/html/2605.06318#bib.bib8)\), marked bysocialness, Iconicity norms\(Winteret al\.,[2024](https://arxiv.org/html/2605.06318#bib.bib9)\), marked byiconicity, and Sensorimotor norms\(Lynottet al\.,[2020](https://arxiv.org/html/2605.06318#bib.bib10)\), per perceptual modalities \(e\.g\. visual\) and action effectors \(e\.g\. arm/hand\), marked by\{modality\|effector\}\(e\.g\.,avg\_arm\)\.

##### Part\-of\-Speech Features\.

We extract the number of tokens per POS tag \(e\.g\., the number of nouns,n\_noun\), the number of lexical tokens \(nouns, verbs, adjectives, and adverbs,n\_lexical\_tokens\), and the POS variability \(number of different POS tags relative to the number of tokens,pos\_variability\)\.

##### Lexical Richness Measures

We extract of the type\-token ratio \(ttr\)\(TEMPLIN,[1957](https://arxiv.org/html/2605.06318#bib.bib11)\), root TTR\(rttr,guiraud1954caractères\), corrected TTR\(cttr, Carroll,[1964](https://arxiv.org/html/2605.06318#bib.bib17)\), Herdan’s C\(herdan\_c,Herdan,[1964](https://arxiv.org/html/2605.06318#bib.bib18)\), Summer’s TTR \(summer\_index,\), Dugast’s Uber index\(dugast\_u,Dugast,[1978](https://arxiv.org/html/2605.06318#bib.bib12)\), Maas’ TTR\(maas\_index,Mass,[1972](https://arxiv.org/html/2605.06318#bib.bib13)\), Yule’sKK\(yule\_k, Yule,[1944](https://arxiv.org/html/2605.06318#bib.bib15)\), Herdan’sVmV\_\{m\}\(herdan\_vHerdan,[1955](https://arxiv.org/html/2605.06318#bib.bib19)\), Simpson’sDD\(simpsons\_d,Simpson,[1949](https://arxiv.org/html/2605.06318#bib.bib20)\), mean segmental TTR\(msttr,Richards and Malvern,[1997](https://arxiv.org/html/2605.06318#bib.bib21)\), moving average TTR\(mattr,Covington and McFall,[2010](https://arxiv.org/html/2605.06318#bib.bib22)\), measure of textual lexical diversity\(mtld,McCarthy and Jarvis,[2010](https://arxiv.org/html/2605.06318#bib.bib24)\), and the hypergeometric distribution diversity\(hdd,McCarthy and Jarvis,[2007](https://arxiv.org/html/2605.06318#bib.bib23),[2010](https://arxiv.org/html/2605.06318#bib.bib24)\), the local and global numbers of hapax \(dis\)legomena \(n\_hapax\_legomena,n\_global\_token\_hapax\_legomena\), Sichel’s S\(sichel\_s, Sichel,[1975](https://arxiv.org/html/2605.06318#bib.bib14)\), and the lexical density \(lexical\_density\)\.

##### Morphological Features\.

We extract the number of tokens with a given morphological feature for all available universal dependencies morpho\-syntactic features\(de Marneffeet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib72)\), marked in the formatn\_\{pos\}\_\{attribute\}\_\{feature\}\(e\.g\., the number of singular nouns,n\_NOUN\_Number\_Sing\)\.

##### Information\-Theoretic Features\.

We extract the compressibility \(compressibility\) and Shannon entropy per item \(entropy\)\.

##### Dependency Features

we extract the number of dependency relation types\(according to Universal Dependencies, de Marneffeet al\.,[2021](https://arxiv.org/html/2605.06318#bib.bib72)\), marked in the formatn\_dependency\_\{type\}\(e\.g\.,n\_dependency\_nsubj\), the number of noun chunks in the text \(noun\_chunks\), the tree width \(tree\_width\), the tree depth \(tree\_depth\), the tree branching factor \(branching\_factor\), and the ramification factor \(ramification\_factor\)\.

##### Semantic Features\.

We extract the average size of the synsets \(avg\_synsets\), the number of tokens with a large synset \(more than four senses;n\_high\_synsets\), and the number of tokens with a small synset \(less than three senses,n\_low\_synsets\) for nouns, adjectives, and verbs, respectively, and overall\. We extract the number of hedges \(n\_hedges\), i\.e\., expressions that indicate speakers uncertainty, for example "probably", "maybe", "i think", etc\.\.

##### Named Entity Features\.

We extract the number of named entities overall \(n\_entities\) and per entity type \(e\.g\.,n\_fac, i\.e\., facilities like buildings, airports and the like\)\.

##### Emotion and Sentiment Features\.

We use the NRC\-VAD lexicon\(Mohammad,[2018a](https://arxiv.org/html/2605.06318#bib.bib73)\)for valence, arousal, and dominance, the NRC emotion intensity lexicon\(Mohammad,[2018b](https://arxiv.org/html/2605.06318#bib.bib74)\)for the emotion intensity per basic emotion \(anger, anticipation, disgust, fear, joy, sadness, surprise, trust\), and the NRC word\-emotion association lexicon\(Mohammad and Turney,[2010](https://arxiv.org/html/2605.06318#bib.bib75),[2013](https://arxiv.org/html/2605.06318#bib.bib31)\)for sentiment\. Per emotion dimension, we extract the average rating per item \(avg\_\{emotion\},avg\_\{valence\|arousal\|dominance\}\), the number of tokens with a high rating \(n\_high\_\{emotion\},n\_high\_\{valence\|arousal\|dominance\}\), and the number of tokens with a low rating \(n\_low\_\{emotion\},n\_low\_\{valence\|arousal\|dominance\}\)\. For sentiment, per item, we extract the number of positive and negative sentiments \(n\_\{positive\|negative\}\_sentiment, and the difference between them, normalized by the total number of tokens in the item \(sentiment\_score\)\.

## Appendix BDomain\-specific Lexicons

SourceFeatureExplanationHatebasen\_hatebaseNumber of tokens found on HatebaseAbusive Wordsn\_abusiveNumber of tokens found in the Abusive Words lexiconHurtlexn\_psNumber of negative stereotype/ethnic slur tokensn\_rciNumber of location/demonym tokensn\_paNumber of profession/occupation tokensn\_ddfNumber of tokens related to physical disabilities and diversityn\_ddpNumber of tokens related to cognitive disabilities and diversityn\_dmcNumber of tokens related to moral and behavioral defectsn\_rciNumber of tokens related to physical disabilities and diversityn\_isNumber of tokens related to social and economic disadvantagen\_orNumber of tokens related to plantsn\_anNumber of tokens related to animalsn\_asmNumber of tokens related to male genitalian\_asfNumber of tokens related to female genitalian\_psNumber of tokens related to prostitutionn\_omNumber of tokens related to homosexualityn\_qasNumber of tokens with potentially negative connotationsn\_cdsNumber of derogatory tokensn\_reNumber of tokens related to felonies, crime, and immoral behaviorn\_svpNumber of tokens related to the seven deadly sins of the Christian traditionHarassment Lexiconn\_genericNumber of tokens related to harassmentn\_sexualNumber of tokens related to sexual harassmentn\_appearanceNumber of tokens related to appearance\-related harassmentn\_racialNumber of tokens related to racial harassmentn\_intelligenceNumber of tokens related to intellectual harassmentn\_politicsNumber of tokens related to political harassmentn\_hatefulNumber of tokens in the union of all lexicons

Table 3:Domain\-specific lexicon\-based features with explanations for them\.Table[3](https://arxiv.org/html/2605.06318#A2.T3)lists the features per source, including a short explanation\.

## Appendix CPreprocessing: Full Details

In the following, we describe the preprocessing of the linguistic features and annotator characteristics used in this work’s analyses\.

##### Linguistic Features

We preprocess the linguistic features in two steps: \(1\) If the feature is an occurrence\-count feature \(e\.g\., number of hedges\), we normalize by the number of tokens in the respective item \(i\.e\., we divide the respective occurrence count by the number of tokens in the item\)\. This allows us to analyze relative trends rather than raw frequencies\. For instance, longer texts can generally be expected to have more hedges, but the interesting items are those with a high number of hedges relative to their length\. \(2\) We standardize all linguistic features to have a mean of0and a standard deviation of11\.

##### Annotator Characteristics

Before pre\-processing annotator characteristics, we remove missing answers and "prefer not to answer" \(or similar\) responses\. We remove all annotators with conflicting characteristics for the same annotator ID\. This ensures the expected data structure, i\.e\., each annotator ID corresponds to exactly one annotator\. While this potentially removes cases of reasonable changes over the span of the respective annotation period \(e\.g\., an annotator turning 35, moving into another age group\), this ensures removing annotators that are almost certainly spammers \(e\.g\., annotators changing all the characteristics\)\.

We harmonize the socio\-demographic variablesgender\(by mapping tomale,female, anddiverse\), andeducationby mapping the respective datasets’ scheme to the international standard classification of education levels\(ISCED 2011, OECDet al\.,[2015](https://arxiv.org/html/2605.06318#bib.bib60)\)\. We harmonizeMHSwithPOPQUORNby coding raw age in years into the respective age ranges \(e\.g\.26→25​\-​2926\\to 25\\text\{\-\}29\)\.

CTDPandMHSallow for arbitraryrace, andCTDPfor arbitrarysexualityandreligionanswers, leading to self\-described identities likeBuddhist, Christian and Atheistfor the same person\. While this points to people wanting to express their more complex identities, it leads to an explosion of the number of categories, with most of them being rather sparse\. Forrace, we thus only keep categories that involve exactly onerace\(e\.g\.WhiteorAsian\) and the five most frequent categories involving two\. The more complex categories are mapped to a catch\-allmultiracialcategory\. For religion and sexuality, we only keep annotators belonging to categories involving exactly one religion and sexuality, respectively\. In consequence, we drop∼2%\\sim 2\\%of annotators forMHSassigning multi\-religious categories and multiple sexualities to themselves151515We observe co\-occurrences especially forbisexualandstraightorgay, potentially indicating these annotators would’ve wished for more fine\-grained preference options\.\.

InD3CODEwe drop thecultural regionvariable and only keep the more fine\-grained geo\-cultural indicatorcountry\.

We keep all other annotator characteristics and code them accordingly\. Table[4](https://arxiv.org/html/2605.06318#A4.T4)in Appendix[D](https://arxiv.org/html/2605.06318#A4)provides an overview of all annotator characteristics used in the following analyses with their variable type \(e\.g\., nominal or ordinal\) and the chosen reference level for nominal variables\.

## Appendix DDetails Annotator Characteristics

Table 4:Annotator characteristics per dataset with data type and reference level for nominal variables\.Table[4](https://arxiv.org/html/2605.06318#A4.T4)shows annotator characteristics per dataset with data type and reference level for nominal variables\.

## Appendix EFull Results for CTDP

Figure[7](https://arxiv.org/html/2605.06318#A5.F7)shows all survivors across the analysis in Section[7](https://arxiv.org/html/2605.06318#S7)\.

![Refer to caption](https://arxiv.org/html/2605.06318v1/x7.png)Figure 7:Posterior estimates for surviving effects of CTDP\.
## Appendix FText Examples

Figure[8](https://arxiv.org/html/2605.06318#A6.F8)contains examples for items from POPQUORN containing a relatively high number of words related to moral/behavioral deficiencies \(n\_dmc\), Figure[8](https://arxiv.org/html/2605.06318#A6.F8)examples for items from POPQUORN containing a relatively high number of named entities of type nationalities, religious or political groups\(n\_norp\)\. Figure[10](https://arxiv.org/html/2605.06318#A6.F10)contains examples for items from MHS containing a relatively high number of tokens with a high olfactory grounding \(n\_high\_Olfactory\), Figure[11](https://arxiv.org/html/2605.06318#A6.F11)contains examples for items from MHS containing a relatively high number of tokens with a high auditory grounding \(n\_high\_Auditory\)\.

Ok\. Thanks for your opinion\.\. I don’t know if the story is true or not\.\. if true\.\. they should be prosecuted\! Don’t enter my home illegally\! No excuse forevilon part of authorities\. … I’m not naive to believe LAR’s are ALL angelic beings\. 78\.2% of Washington Post info is biased, unverified andmisleading\. There’s your debate\.

\(a\)
I don’t think you can argue this very well, because someone who has lessmoralsorethicsthan another person is not excluded from actingmorallyor ethicallt, but someone who does have morals and ethics would not actimmoralorunethical\. For this reason the person who has less morals or ethics has a larger set of options when making money compared to the more moral/ethical, so they will always make equal or better than the moral person\. I think a more difficult question is whether acting immorally makes you more money than acting morally\.

\(b\)
Well, they are likelyfilthyrich…

\(c\)

Figure 8:Example items from POPQUORN containing a relatively high number of words related to moral/behavioral deficiencies \(n\_dmccolored incyan\)\.Don’t worry\.Israelhas already told theUNthere will be no investigation\. Gotta love thatjewprivilege\.

\(a\)
Congrats on the 1:30Israeli/Palestiniancasualty ratio\. Hamas must be patting themselves on their backs and looking for a repeat of that success\.

\(b\)
A less exhaustive background check would risk acceptingMuslimson the flight, which would completely defeat the whole point of aMuslim\-free flight\.

\(c\)
I guess I can say the same aboutMuslims? Just waiting for y’all to learn that they do inMuslimcountries, it makes theCatholicChurch look like Disneyland

\(d\)

Figure 9:Example items from POPQUORN containing a relatively high number of named entities of type nationalities, religious or political groups\(n\_norp, colored incyan\)\.Hahaha I lovetrashbag nazi Americans\. Have fun getting shot you inbred piece ofshit

\(a\)
Some realnastywomen over here, thenastiest\. Sad\!

\(b\)
talkshitto my face so i can rock yourshitpussy

\(c\)

Figure 10:Example items from MHS containing a relatively high number of tokens with a high olfactory grounding \(n\_high\_Olfactory, colored incyan\)\.It’s easy, randomlyclickuntil you lose\.

\(a\)
We all know she’ll try to \*talkto the manager\* while down in hell\.

\(b\)

Figure 11:Example items from MHS containing a relatively high number of tokens with a high auditory grounding \(n\_high\_Auditory, colored incyan\)\.
## Appendix GLinguistic Feature Preselection: Example

To showcase our selection procedure as described in Section[4](https://arxiv.org/html/2605.06318#S4), we go through a full selection procedure in the following forPOPQUORN\.

We start with 327 features, which we first filter by pairwise Pearson correlation, arriving at 106 features, each of which is correlated lower than0\.50\.5with any of the other features\. The remaining 225 features are clustered to select features of interest in the correlation clusters\.

\[’tokens\_per\_sentence’, ’tree\_width’, ’tree\_depth’, ’tree\_branching’, ’entropy’, ’mtld’, ’yule\_k’, ’herdan\_v’, ’n\_polysyllables’, ’flesch\_reading\_ease’, ’flesch\_kincaid\_grade’, ’gunning\_fog’, ’ari’, ’smog’, ’lix’, ’rix’\]

Figure 12:Example: Cluster 6 in the linguistic feature preselection procedure forPOPQUORN\.For instance, one of the clusters is shown in Figure[12](https://arxiv.org/html/2605.06318#A7.F12)\. The features in this cluster are all related to readability and syntactic complexity\. In this case, we pick the Flesch reading ease, as it is a well\-used readability measure\.

\[’avg\_word\_length’, ’n\_long\_words’, ’n\_NOUN\_Number\_Sing’, ’n\_PRON\_PronType\_Prs’, ’n\_PRON\_Number\_Sing’, ’n\_PRON\_Case\_Nom’, ’n\_PRON\_Person\_1’, ’n\_ADJ\_Degree\_Pos’, ’n\_PROPN\_Number\_Sing’, ’n\_dependency\_amod’, ’n\_dependency\_compound’, ’n\_dependency\_nsubj’, ’n\_lexical\_tokens’, ’lexical\_density’, ’n\_adj’, ’n\_noun’, ’n\_pron’, ’n\_propn’, ’avg\_aoa’, ’n\_high\_aoa’, ’avg\_sd\_aoa’, ’cli’, ’avg\_n\_synsets’, ’avg\_n\_synsets\_verb’, ’n\_low\_synsets’, ’n\_high\_synsets’, ’n\_entities’, ’n\_person’, ’n\_entities\_token\_ratio’, ’n\_entities\_sentence\_ratio’, ’n\_global\_token\_hapax\_legomena’, ’n\_global\_token\_hapax\_dislegomena’, ’global\_sichel\_s’, ’n\_global\_lemma\_hapax\_legomena’, ’n\_global\_lemma\_hapax\_dislegomena’\]

Figure 13:Example: Cluster 2 in the linguistic feature preselection procedure forPOPQUORN\.Naturally, as theoretical equivalence is only one reason for high correlations, not all of the clusters will contain clearly\-cut, theoretically equivalent features\. For instance, in the cluster shown in Figure[13](https://arxiv.org/html/2605.06318#A7.F13)\. We find reasonably related but not necessarily theoretically equivalent features\. For example, the number of nouns \(n\_noun\), the number of named entities \(n\_entities\), the number of adjectives \(n\_adj\), and the number of pronouns \(n\_pron\) clearly have systematic relationships\. In cases like this, we pick a feature allowing for an intuitive interpretation\. In this concrete case, we pick the number of nouns\.

In the end, we arrive at 113 linguistic features we use in the regression modeling ofPOPQUORN\.

## Appendix HTerm Glossary

The aim of this appendix section is to clarify the meaning of key terms employed in the paper, providing the reader with conceptual coordinates that are necessary for a \(deeper\) understanding of the statistical concepts employed in our narrative, and yet too extensive to find room in the main paper\.

##### \(Partially\) cross\-classified data structure

Datasets regularly follow certain structures\. For examples, pupils are nested within classes, which are in turn nested within schools, which are in turn nested within neighborhoods, which are in turn nested within cities\. These nested structures evoke dependencies, which should be taken into account\. That is, it is likely that pupils within one class are more similar compared to other pupils\. In other situations, observations can be ascribed to a combination of factors\. More specifically, a rating is made for one specific item by one specific annotator\. Here, annotator ID and item ID are factorially crossed\. When all combinations of annotator ID and item ID exist, we call this a fully cross\-classified data structure\. In contrast, when only part of all combinations of annotator ID and item ID exist, we call this a partially cross\-classified data structure\. In our analyses, we encounter partially cross\-classified data structures because not all annotators rated all items or, vice versa, not all items were rated by all annotators\.

##### Frequentist vs\. Bayesian \(regression\)

Regression analysis is a statistical method for investigating relationships among variables\. It can be used for purposes of prediction and/or explanation\. In this paper, annotation behavior \(i\.e, ratings of hatefulness on a scale from 1 to 5\) is predicted on the basis of several linguistic features of the texts and annotator characteristics\.

Frequentist and Bayesian form two separate philosophical frameworks to statistics\. One key difference is the notion of probability\. While frequentists see probability as a long\-run frequency of events \(e\.g\., how often do I get heads if I flip a coin 100,000 times?\), Bayesians see probability as a degree of belief or plausibility \(e\.g\., I believe that there is a 90% change that it will rain tomorrow\)\. As such, the interpretation of results differs\. The treatment of probability as a degree of belief opens the possibility to incorporate prior information/beliefs into a regression model\. In that sense, Bayesian statistics updates our prior beliefs by incorporating the data that we have into a posterior belief \(i\.e\., what we should belief after seeing the data\)\.

Whether to pick a frequentist or Bayesian analysis is often a matter of philosophical/statistical preference\. However, some situations particularly lend themselves to a Bayesian framework\. For example, it can be argued that data with few observations profit from more stable estimates in a Bayesian framework because of the incorporation of prior knowledge\. As another example, if the data is high\-dimensional, with more predictors than observations, and the goal is to find the most important predictors, then a Bayesian regression with a horseshoe prior works well\. This is precisely the use case of our analyses\. While some of the datasets we employ have a large number of observations \(at least for some datasets\), our large number of predictors \(annotator attributes, linguistic features, interactions\) creates a situation in which the amount of observations is limited compared to the number of parameters\. In this scenario, adopting a Bayesian approach allows for a more stable estimation of the surviving effects, thanks to the strong impact of uncertainty in the regularization process\. In practical terms, given the observation/parameter ratio, this approach allows us to be more confident in the effects we observe, than we would have been in a frequentist framework\. An additional practical reason is the fact that the Bayesian framework has progressed a lot in the computational optimization for hierarchical models \(compensating for the higher computational cost\)\.

##### Outcome \(predicted, dependent\) vs\. predictor \(independent\)

In regression analysis, we do not only investigate simple relationships among variables\. Instead, we try to predict an outcome variable \(also called dependent variable\) based on one or more predictor variables \(also called independent variables\)\. For example, in this paper we attempt to predict annotation behavior on the basis of several linguistic features of the texts as well as annotator characteristics\.

##### Main effect vs\. interaction effect

In a regression model, there are different types of effects that might be of interest\. Main effects relate to the effect of a variable on the outcome, keeping other independent variables constant at some value\. For instance, what is the effect of annotator age on annotation behavior, keeping all the other independent variables constant at some value? If we employ annotator age or presence of hateful words as main effects in the prediction of annotation behavior, we will learn to what extent an increase in the age of the annotator or in the number of hateful words in an item corresponds to changes in offensiveness ratings \(keeping all the other independent variables constant at some value\)\. Take for example the plot in Figure[2](https://arxiv.org/html/2605.06318#S5.F2), containing the estimates for the main effects \(and interactions\) for the POPQUORN dataset\. Positive estimates indicate that, when a predictor has a higher value \(if continuous, e\.g\., number of hateful words\), the model has identified a tendency for annotations to be higher \(i\.e\., higher degree of offensiveness\)\. Negative estimates indicate the opposite\.

In contrast, interaction effects demonstrate whether the effect of one independent variable on the dependent variable depends on the value of another independent variable, holding all other independent variables constant at0\. We standardized the predictors, so that the mean value is at0\. As such, the interpretation of interaction effects is meaningful because keeping all other independent variables constant at0translates to keeping all other independent variables constant at the mean\. For example, is the effect of age on annotation behavior different depending on the presence of hateful words in the annotated items? To fully understand interaction effects, it is not enough to inspect Figure[2](https://arxiv.org/html/2605.06318#S5.F2): in this case, the estimate is telling us that the impact of the number of hateful words on offensiveness ratings depends on age sub\-groups of our annotators\. This is precisely the effect displayed in Figure[14](https://arxiv.org/html/2605.06318#A15.F14)\. The plot shows how the effect on annotation behavior between three chosen values for the presence of hateful words is different for the age groups\. Recall that the variables are standardized, so0means that an item is in the range of the average number of hurtful words across all items in the dataset,11that it is one standard deviation above that value, and−1\-1that it is one standard deviation below\. The interaction shows that as annotator age increases, the effect of the number of hateful words on annotation behavior becomes more pronounced\.

##### Fixed vs\. random effect

Many types of data follow a hierarchical structure\. For example, in our scenario, annotations belong to one specific combination of annotator ID and item ID \(see glossary ‘\(Partially\) cross\-classified data structure’\)\. This creates dependencies/correlations within clusters/groups that should be taken into account in most cases when creating statistical models\. More specifically, we would expect that annotators have different baseline levels of annotating items; one annotator might be very sensitive to hateful items while another annotator might not be\. The same applies to items\. These natural tendencies can be modeled in a regression by introducing random intercepts \(i\.e\., one type of random effect\)\. The resulting parameter estimate shows the natural tendency of annotators or items to vary at baseline\.

In practical terms, this means that when estimating the fixed effects for our predictors, we assume that each individual annotator has a different starting point when annotating hatefulness/offensiveness: this is the baseline from which they start off, technically in the model, the random intercept\. The stronger the variance between the "baselines" of each annotator in the dataset, the stronger the impact of the corresponding random effect\. Given the subjectivity of the tasks at hand, it is crucial to "factor out" this type of individual variation when making generalizations about the data being annotated\. More generally, we consider it good practice to consider random effects in any task, even the less subjective ones, given that we have a multilevel data structure\.This is true even if the random effects turn out to be weak \(i\.e\., our annotators have a similar baseline in annotation behavior\)\.

The other type of random effect would be random slopes, which would show whether there are different relationships between predictors and the outcome, depending on annotator ID or item ID\. We did not include random slopes in our models\.

In contrast to random effects, the familiar fixed effects \(e\.g\., the relationship between age and annotation behavior\) constitute systematic population\-average effects\.

##### Prior

The use of priors is a distinctive feature of Bayesian statistics as opposed to frequentist statistics\. Priors are probability distributions that, in most cases, are supposed to mirror previous knowledge \(i\.e\., before seeing the data\) about some assertion, such as a parameter \(e\.g\., the probability that a coin lands heads\)\.A simple example is the toss of a coin: Before tossing a specific coin, we already know that the coin is probably fair and brings up heads and tails in equal proportions\. As such, we would place more certainty on 50%/50% compared to 10%/90%\. This knowledge \(and our certainty thereof\) can be incorporated into the prior\. Bayesian modeling can then be conceived as updating our prior knowledge/suspicion by incorporating what the data tells us into a posterior knowledge \(i\.e\., what we should believe after having incorporated what we learned from the data into our prior knowledge\)\. To continue with the coin toss example, assume that prior to tossing the coin you have no reason to doubt its fairness\. Now you toss the coin100100times and heads is observed9999times\. Based on this we would be well\-advised to rethink and update our beliefs about the fairness of the coin\. How much we should adjust our beliefs depends on how strong our beliefs were initially\. If our beliefs were quite vague, we should update more towards biasedness; if our beliefs were strong, we should update less\.

Even though this toy example provides an intuition about Bayesian statistics, it is worthwhile to consider an example that is more scientifically relevant and where the prior is informed by previous study results\. Imagine that previous studies have shown that there is a positive effect of age on annotator ratings\. This alone gives us information that the prior should be focused on positive effects\. We could further look at the effect estimates and their uncertainties to refine our prior\. That is, we could more precisely set more emphasis on the particular range of effects that was found in previous studies\. Thought this way, the prior summarizes previous scientific knowledge and our study and our data serve to update this knowledge\.

There are certain situations, however, where priors are strategically used for a certain purpose\. Here, the prior does not reflect prior beliefs anymore\. In our analyses, we use the Horseshoe prior for strong regularization\. That means that weak and uncertain effects get pushed toward0; only stronger and more certain effects remain mostly untouched\.

## Appendix IMethodological Choices & Alternatives

At given points of our analysis, we decided on one of multiple alternatives\. For full transparency, we discuss reasoning and alternatives in the following subsections\.

### I\.1Preprocessing

##### Removing Missing/Non\-Answers

Given that one focus of our analyses is to explore whether there are systematic effects of annotator characteristics and interactions of them with item\-level features, we chose to remove missing answers and "Prefer not to answer" responses, since they make modeling more complex, especially for otherwise ordinal predictors\. Furthermore, reasons why people do not want to disclose certain information about themselves may be multifaceted or complex to disentangle\.

It can, however, be argued that these missing or non\-answers provide valuable information\. But dealing with missing data is notoriously difficult because it is rarely clear whether data are missing completely at random, missing at random, or missing not at random\. Working with missing data would involve investigating the source of missingness\.

##### Using Catch\-all Categories

For very rare multi\-answer categories in the annotator characteristics ofraceandreligion, we chose to re\-code them to catch\-all categories \(see Appendix[C](https://arxiv.org/html/2605.06318#A3)for more details\)\. While such categories may reflect complex identities of annotators that will be flattened when re\-coded to catch\-all categories, other approaches, such as treating each answer option as its own predictor and encoding them in a one\-hot manner, would increase the complexity of our models\. For works particularly interested in these specific characteristics, and, for instance, in multiracial identities and interactions of individual racial backgrounds, this may be a viable alternative\.

##### Harmonizing Features

For our first analysis \(Section[5](https://arxiv.org/html/2605.06318#S5)\), we re\-code education to match the ISCED 2011 across datasets\. While this is necessary for comparability across datasets that span annotators from and residing in multiple countries, a study focused on a single country may not benefit from or require that step\.

### I\.2Feature Selection

##### More Theoretic Works

The present work, to a large extent, is exploratory\. The presented \(linguistic\) feature selection method in the main body of this paper aims to remove, in terms of the correlation structure, redundant features while still retaining as many features as possible\. A more theory\-driven analysis, for example, could alleviate the necessity for a partially automated feature selection procedure by carefully pre\-defining which features are of interest for the given question, and eliminating the necessity to include hundreds or thousands of predictors\.

##### Heuristic Selection of Features

The manual selection of representative features after clustering in our feature selection workflow is done mostly for interpretability reasons\. In principle, the inspection could be replaced with an automatic choice per cluster\. This could be either done randomly or heuristically, for example, with a ranking of which features to keep over others if in doubt\. Depending on the correlation structure, this could also be applied to the whole set of features instead of clusters\.

##### Dimensionality Reduction

While approaches using dimensionality reduction may be sensible for extracting latent features that are uncorrelated/orthogonal in annn\-dimensional feature space, one of our goals in selecting linguistic features is to retain interpretability for individual predictors\. Lower\-dimensional latent features may be preferable in works where computational performance and prediction are vital, or notions of distance are of interest, as they provide a more compact representation that is more straightforward to inspect and causes lower complexity in models using them\.

### I\.3Modeling Decisions

##### Random Slopes

We did not model random slopes because there is no strong theoretical basis for this, and it would have been computationally infeasible\. A more focused analysis with fewer features could explore whether features vary in their relationship to annotation behavior across groups\.

##### Measurement Level of the Annotation Behavior and Linkage Functions

In our analyses, we treat annotation behavior as a continuous variable, as we assume the underlying mental construct to be quasi\-continuous\. Since it is measured on a Likert scale, however, one can argue that it should not be modeled as a continuous variable\. To check robustness, we also model it as an ordinal variable using a cumulative likelihood with a probit link function\(Bürkner and Vuorre,[2019](https://arxiv.org/html/2605.06318#bib.bib51)\)in pilot experiments in Appendix[N](https://arxiv.org/html/2605.06318#A14)forPOPQUORN\.

## Appendix JImplementation Details

### J\.1Preprocessing

Throughout the preprocessing pipeline, we usePython 3\.10\.16andpolars\. Our linguistic feature extraction useselfen 1\.0\.2\. Our feature filtering procedure usesscipy 1\.14\.1andnumpy 1\.26\.4\.

We usetidyverse 2\.0\.0inR 4\.3\.3, andpolarsinPython 3\.10\.16for preprocessing the annotator characteristics\.

### J\.2Regression Modeling

We implement the regression models inR 4\.3\.3usingbrms 2\.23\.0with thecmdstanr 0\.9\.0backend \(cmdstan 2\.37\.0\)\.

We use the default parameters for the horseshoe prior and the default priors in brms\(Bürkner,[2017](https://arxiv.org/html/2605.06318#bib.bib54)\)for all other parameters\.

#### J\.2\.1Model Formulation

Annotation behavior, is modeled as a function of the main effects of the linguistic and annotator features \(`X\_L`and`X\_S`, respectively\) , the interactions among the annotator features \(`X\_S:X\_S`\) , and the interactions between linguistic and annotator features \(`X\_L:X\_S`\) \. To incorporate the partially cross\-classified data structure, random intercepts for the items and annotators are included \(`\(1 \| item\)`and`\(1 \| annotator\)`, respectively\):

```
y ~ X_L + X_S + X_S:X_S + X_L:X_S +
   (1 | item) + (1 | annotator)
```

#### J\.2\.2Sampling Details

For each analysis, we use44chains to sample from the posterior distribution\. Each chain is initiated with2,0002,000warmup iterations, which are then discarded\. After warmup,7,5007,500samples are drawn from the posterior, yielding30,00030,000samples for consideration\.

#### J\.2\.3Number of Effects per Dataset Models

Table[5](https://arxiv.org/html/2605.06318#A10.T5)shows the total number of effects per dataset model\.

Table 5:Number of effects per dataset models, including main effects and interactions\.

### J\.3Usage of AI Assistants

In this work, we used GitHub Copilot for inline suggestions and Grammarly for grammatical corrections\.

## Appendix KReproduction of the Original POPQUORN Analysis

Table[6](https://arxiv.org/html/2605.06318#A11.T6)shows a reproduction of the effects for POPQUORN reported byPei and Jurgens \([2023](https://arxiv.org/html/2605.06318#bib.bib79)\)only with random intercepts for items \(\(1 \| item\)\) and a comparison to the same model with random intercepts for both items and annotators \(\(1 \| item\) \+ \(1\| annotator\)\)\. Note that, in contrast to the analyses in the main text of this work, these are frequentist models\. As the results show, given models with the same predictors, all significant effects in a model with only a random intercept for items disappear when a random intercept for annotators is added\.

Table 6:Reproduction of the original POPQUORN analysis with only random intercepts for items \(\(1 \| item\)\) and a comparison with also including annotator intercepts \(\(1 \| item\) \+ \(1\| annotator\)\)\. We report the coefficients \(Coef\.\), the standard deviation error \(Std\.Err\.\), and p\-value \(P\>\|t\|P\>\|t\|\)\. Significant estimates \(p<0\.05p<0\.05\) arebolded\.
## Appendix LUsed Resources

For transparency and as an estimate for similar analyses, this section reports the computational resources that were needed for the analyses reported in the main body of this paper\. We report \(a\) the required memory, and \(b\) compute times per dataset and model\.

### L\.1Memory Needs

Table[7](https://arxiv.org/html/2605.06318#A12.T7)shows the memory needs of our experiments\. We note that these memory needs are specific to the parameter choices, our hardware, and the choice of thebrmsbackend\. For some observations and comparisons on this, see Appendix[M](https://arxiv.org/html/2605.06318#A13)\.

Table 7:Used memory of each of the models in our analyses\.
### L\.2Compute Times

Table[8](https://arxiv.org/html/2605.06318#A12.T8)shows the runtimes of the models reported in the main body of this work\.

DatasetSplitTime \(in Days\)POPQUORN–4\.23MHS–10\.59D3CODE111\.12212\.23CTDP15\.7426\.4837\.26Total57\.65Table 8:Compute times of each of the final models used in this work and overall\.

## Appendix MObservations on Limitations of Implementations and Resources

During our analyses, we were confronted with several limitations of implementations and our available resources\. In the following, we discuss the most marked ones\.

The brms backend that is used can make a drastic difference\.Given the large number of predictors in our regression models, we had to balance runtime and memory requirements\. Pilot experiments on earlier formulations of models for thePOPQUORNdataset\. When comparing the exact same model formulation withcmdstanr, the backend we use throughout the reported analyses in the paper, instead ofrstan, we see a60% reduction in memory needs\. This, however, does not come without a drawback, as we see a20% increase in runtime\.

Similarly, usingcmdstanrmay come with limitations with respect to how many predictors of which type \(ordinal, nominal, etc\.\) are used, and how they are represented in intermediate steps internally\. We ran into several231−12^\{31\}\-1bytes limitation errors that traced back to internal intermediate transformations to JSON strings\. While this specific problem may pertain to the specific version ofcmdstanrwe are using, this points to the more general limitation that given large enough data and complex enough model formulations,existing implementations may not be able to handle such analyses\.

## Appendix NRobustness Check: Testing different model formulations\.

To test robustness, we ran pilot experiments onPOPQUORNwith a Gaussian likelihood and an identity link function, and a cumulative likelihood and a probit link function, with three horseshoe prior settings: \(a\) the default horseshoe prior, \(b\) a horseshoe prior with the global shrinkage parameter set to half of the default, and \(c\) a horseshoe prior with the student\-t slab scale set to10610^\{6\}\.

Table 9:Pairwise Pearson correlations between z\-scored estimates of different model configurations\.Grefers to Gaussian likelihood with an identity link function, andPto the cumulative likelihood with a probit link function\.Arefers to the default horseshoe prior,Bto a horseshoe prior with a halved global shrinkage parameter, andCto a horseshoe prior with the student\-t slab scale set to10610^\{6\}\.We compare the models by calculating the pairwise Pearson correlation between their z\-scored estimates\. Table[9](https://arxiv.org/html/2605.06318#A14.T9)shows the results\. All of the combinations reach a Pearson correlation of over 0\.91, indicating stable results across model formulations\. While there is no guarantee that this holds for our other datasets, we assume this to be the case and, given the time and resource requirements of each of the runs, do not run such stability comparisons for the other datasets\.

## Appendix OInteractions

![Refer to caption](https://arxiv.org/html/2605.06318v1/x8.png)Figure 14:age:n\_hateful\_all\_lexicons \(POPQUORN\)![Refer to caption](https://arxiv.org/html/2605.06318v1/x9.png)Figure 15:ideology:age \(MHS\)

Similar Articles

Understanding Annotator Safety Policy with Interpretability

arXiv cs.AI

This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.