Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

arXiv cs.CL Papers

Summary

This paper proposes demographic-conditioned fusion embeddings to model perspectivist social meaning in language, showing consistent improvements over text-only baselines by integrating annotator demographics into NLP systems.

arXiv:2606.07123v1 Announce Type: new Abstract: Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions. However, most NLP systems collapse this variation into a single ground-truth label, ignoring the diversity of interpretations. In this work, we model social dimensions along a perspectivist spectrum, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations. We benchmark multiple modeling paradigms, including zero-shot, few-shot, and fine-tuned approaches, and propose fusion embeddings that integrate textual and demographic representations. Our fusion models yield consistent and statistically significant improvements over text-only baselines across all fusion strategies (+5.9-6.5% relative macro PR-AUC), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:22 AM

# Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings
Source: [https://arxiv.org/html/2606.07123](https://arxiv.org/html/2606.07123)
Amanda Cercas Curry Independent Researcher amanda\.cercas@gmail\.com &Lucio La Cava University of Calabria lucio\.lacava@dimes\.unical\.it Luca Maria Aiello IT University of Copenhagen luai@itu\.dk &Gianmarco De Francisci Morales CENTAI gdfm@acm\.org

###### Abstract

Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions\. However, most NLP systems collapse this variation into a single ground\-truth label, ignoring the diversity of interpretations\. In this work, we model social dimensions along a*perspectivist spectrum*, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations\. We benchmark multiple modeling paradigms, including zero\-shot, few\-shot, and fine\-tuned approaches, and proposefusion embeddingsthat integrate textual and demographic representations\. Our fusion models yield consistent and statistically significant improvements over text\-only baselines across all fusion strategies \(\+5\.95\.9–6\.56\.5% relative macro PR\-AUC\), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations\.

Learning Perspectivist Social Meaning via Demographic\-Conditioned Fusion Embeddings

Amanda Cercas CurryIndependent Researcheramanda\.cercas@gmail\.comLucio La CavaUniversity of Calabrialucio\.lacava@dimes\.unical\.it

Luca Maria AielloIT University of Copenhagenluai@itu\.dkGianmarco De Francisci MoralesCENTAIgdfm@acm\.org

## 1Introduction

Social meaning is not fixed in text: it is constructed in the act of reading\. The same utterance can convey knowledge\-sharing to one reader, status\-conferral to another, and support to a third, depending on who interprets it and through which cultural, demographic, and ideological lens\. This perspectival nature of pragmatic meaning has long been acknowledged in sociolinguisticsPlank \([2022](https://arxiv.org/html/2606.07123#bib.bib11)\); Hovy and Yang \([2021](https://arxiv.org/html/2606.07123#bib.bib13)\)\. Yet, most NLP systems treat social labels as objective properties of text by collapsing the diversity of human interpretation into a single ground truth\.

Building on decades of social science research,Choiet al\.\([2020](https://arxiv.org/html/2606.07123#bib.bib21)\)operationalized ten fundamental dimensions of social interaction—knowledge, power, status, trust, support, romance, similarity, identity, fun, and conflict—and showed that they can be reliably detected from conversational text\. Their framework has since been applied to study agreement, coordination, and community well\-being in online settingsMontiet al\.\([2022](https://arxiv.org/html/2606.07123#bib.bib22)\); Lucchiniet al\.\([2022](https://arxiv.org/html/2606.07123#bib.bib23)\); Aielloet al\.\([2021](https://arxiv.org/html/2606.07123#bib.bib24)\); Balsamoet al\.\([2023](https://arxiv.org/html/2606.07123#bib.bib25)\)\. However, the original dataset was collected with a homogeneous annotator pool, released only in aggregated form, and provided no information on annotator demographics, making it impossible to study how interpretation varies across people\.

P1SCO\(Curryet al\.,[2026](https://arxiv.org/html/2606.07123#bib.bib33)\)addresses this gap by introducing a large\-scale, disaggregated dataset of social dimension annotations across three social media platforms \(Reddit, YouTube, and Instagram\), contributed by 543 demographically diverse participants from the US and UK\.P1SCOdemonstrates that gender, age, nationality, and political orientation, as well as Big Five personality traits, correlate with label assignment across all ten dimensions\. Crucially, homogeneous demographic groups exhibit higher within\-group agreement than the overall population, thus suggesting that shared social experiences produce convergent interpretive frameworks\. These findings establish a key premise for the present work: annotator disagreement in this task is not noise, rather it is signal\.

In this paper, we ask:can models learn to predict how different demographic groups perceive social dimensions?We frame this as aperspectivist predictiontask, in which the model receives both a candidate text and an annotator demographic profile, and must estimate the probability that an annotator with that profile would assign each social dimension\. UsingP1SCOas our evaluation platform, we benchmark a range of modeling paradigms \(zero\-shot, few\-shot, and fine\-tuned\) and proposedemographic\-conditioned fusion embeddingsthat integrate textual and demographic representations at multiple levels of depth\.

Empirically, we find that demographic conditioning consistently improves over text\-only baselines, with our best fusion model achieving a∼6\.5%\{\\sim\}6\.5\\%relative gain in macro PR\-AUC\. A shuffle ablation confirms that these gains reflect genuinely informative demographic signals rather than spurious correlations\. Per\-label analysis reveals the largest improvements for semantically ambiguous dimensions, Power \(\+51\.9%\+51\.9\\%relative\) and Trust \(\+30\.1%\+30\.1\\%relative\), where demographic perspective most strongly mediates interpretation\.

#### Contributions

- •We establish the first comprehensive baselines for perspectivist social dimension prediction onP1SCO, benchmarking zero\-shot, few\-shot, and fine\-tuned paradigms\.
- •We propose fusion embeddings combining textual and demographic representations at three integration depths, achieving up to\+6\.5%\+6\.5\\%relative macro PR\-AUC over text\-only models, with all fusion strategies yielding statistically significant gains\.
- •Through shuffle ablations and per\-label analysis, we show that demographic signals carry genuine predictive information and that the full gender–age–nationality triplet outperforms any demographic subset alone\.

## 2Task Formulation

Letℒ\\mathcal\{L\}denote the ten\-label set\. For a candidateccwith textxcx\_\{c\}, letRcR\_\{c\}denote the set of annotators who labelled it, and letvc,ℓ\(r\)∈\{0,1\}v^\{\(r\)\}\_\{c,\\ell\}\\in\\\{0,1\\\}be the binary label forccand dimensionℓ∈ℒ\\ell\\in\\mathcal\{L\}assigned byr∈Rcr\\in R\_\{c\}, indicating whether a social dimension is present\.

For any non\-empty annotator subsetSc⊆RcS\_\{c\}\\subseteq R\_\{c\}, we define asoftclassification target as the fraction of annotators inScS\_\{c\}that recognized dimensionℓ\\ellin candidate textccas:

sc,ℓ​\(Sc\)=1\|Sc\|​∑r∈Scvc,ℓ\(r\),s\_\{c,\\ell\}\(S\_\{c\}\)=\\frac\{1\}\{\|S\_\{c\}\|\}\\sum\_\{r\\in S\_\{c\}\}v^\{\(r\)\}\_\{c,\\ell\},\(1\)and the correspondinghardclassification target as:

yc,ℓ​\(Sc\)=𝕀​\[sc,ℓ​\(Sc\)\>0\.5\]\.y\_\{c,\\ell\}\(S\_\{c\}\)=\\mathbb\{I\}\[s\_\{c,\\ell\}\(S\_\{c\}\)\>0\.5\]\.\(2\)
Specifically,sc,ℓ​\(Sc\)s\_\{c,\\ell\}\(S\_\{c\}\)indicates the degree of agreement inside the annotator setScS\_\{c\}, whereasyc,ℓ​\(Sc\)y\_\{c,\\ell\}\(S\_\{c\}\)indicates the corresponding binary majority\-version \(with exact ties defaulting to 0\)\.

Note that, hereinafter, the data splitting is always performed at the candidate level usingiterative multilabel stratification, so as to avoid leaking candidate content between splits\.

#### Majority Prediction

This task corresponds a special case whereSc=RcS\_\{c\}=R\_\{c\}, i\.e\., all annotators are taken into account\. For this task, we derive thesc,ℓm​a​j=sc,ℓ​\(Rc\)s\_\{c,\\ell\}^\{maj\}=s\_\{c,\\ell\}\(R\_\{c\}\)andyc,ℓm​a​j=yc,ℓ​\(Rc\)y\_\{c,\\ell\}^\{maj\}=y\_\{c,\\ell\}\(R\_\{c\}\)majority labels by only considering candidates having at least three annotations\. The input isxcx\_\{c\}, and the output yields one probability value per social dimension\. In the hard\-label setting, we useyc,ℓm​a​jy\_\{c,\\ell\}^\{maj\}for evaluation, while in the soft\-label setting we compare the model predictions withsc,ℓm​a​js\_\{c,\\ell\}^\{maj\}\.

The final split for this task contains38793879/550550/11041104train/val/test candidates\.

#### Perspectivist Prediction

This task setting defines the annotator subsetScS\_\{c\}based on demographic profiles\. Let us denote with𝐦r=\(gr,ar,nr\)\\mathbf\{m\}\_\{r\}=\(g\_\{r\},a\_\{r\},n\_\{r\}\)the demographic profile for annotatorrr, consisting ofrr’s gender, age group, and nationality, and letG=\(g,a,n\)G=\(g,a,n\)define a demographic group as a tuple\. For a candidatecc, the group\-specific annotator subset is:

Rc​\(G\)=\{r∈Rc:𝐦r=G\}\.R\_\{c\}\(G\)=\\\{r\\in R\_\{c\}:\\mathbf\{m\}\_\{r\}=G\\\}\.\(3\)
IfRc​\(G\)R\_\{c\}\(G\)is non\-empty, we can use the definitions above to define the corresponding group\-specific soft and hard labels assc,ℓ\(G\)=sc,ℓ​\(Rc​\(G\)\)s\_\{c,\\ell\}^\{\(G\)\}=s\_\{c,\\ell\}\(R\_\{c\}\(G\)\)andyc,ℓ\(G\)=yc,ℓ​\(Rc​\(G\)\)y\_\{c,\\ell\}^\{\(G\)\}=y\_\{c,\\ell\}\(R\_\{c\}\(G\)\)\.

We emphasize that, instead of collapsing individual annotations to group\-level ones, we retain one record per observed annotation, exploiting intra\-group variability to better calibrate predictions\. Specifically, task input corresponds to a pair\(xc,𝐦r\)\(x\_\{c\},\\mathbf\{m\}\_\{r\}\)and the expected output estimatesp​\(vc,ℓ\(r\)=1∣xc,𝐦r\)p\(v^\{\(r\)\}\_\{c,\\ell\}=1\\mid x\_\{c\},\\mathbf\{m\}\_\{r\}\)for each label\.

Note that for a given groupGG, this probability corresponds to an estimate ofsc,ℓ\(G\)s\_\{c,\\ell\}^\{\(G\)\}, whereas the corresponding hard\-label can be obtained by thresholding these probabilities\.

The final split for this task contains19 02219\\,022/26912691/54805480train/val/test candidates\.

## 3Architecture

### 3\.1Text Encoder

The main architecture we train to address the social dimension classification tasks consists of a fine\-tuned RoBERTa\-largeLiuet al\.\([2019](https://arxiv.org/html/2606.07123#bib.bib1)\), implemented via theHuggingFacelibrary\. Let us denote withEEthe encoder, with a set of parametersθ\\theta\. For any candidate input textxcx\_\{c\}, we collect the final\-layer, first\-token representation𝐡c=Eθ​\(xc\)∈ℝH\\mathbf\{h\}\_\{c\}=E\_\{\\theta\}\(x\_\{c\}\)\\in\\mathbb\{R\}^\{H\}, i\.e\., the\[C​L​S\]\[CLS\]token, withH=1024H=1024for RoBERTa\-large\.

A small classification head maps this text representation to one logit per social dimension:

𝐳ctext=Ctext​\(𝐡c\)∈ℝ\|ℒ\|\\mathbf\{z\}^\{\\mathrm\{text\}\}\_\{c\}=C\_\{\\mathrm\{text\}\}\(\\mathbf\{h\}\_\{c\}\)\\in\\mathbb\{R\}^\{\|\\mathcal\{L\}\|\}\(4\)whereCtextC\_\{\\mathrm\{text\}\}denotes the standard RoBERTa classification\-head design\. Majority\-vote social\-dimension prediction leverages these logits directly, whereas perspectivist prediction uses them as a text\-only baseline\.

### 3\.2Demographic Encoder

To encode a given socio\-demographic perspective, we consider the annotator’s gender \(gg\), age group \(aa\), and nationality \(nn\) and encode them separately into 64\-dimensional vectors\. The encoding uses RoBERTa to transform the textual representation of the specific sociodemographic group into its embedding\. Then, we project the concatenated representation as follows:

𝐝r\\displaystyle\\mathbf\{d\}\_\{r\}=Ddemo​\(\[𝐞g​\(gr\);𝐞a​\(ar\);𝐞n​\(nr\)\]\)∈ℝ128,\\displaystyle=D\_\{\\mathrm\{demo\}\}\(\[\\mathbf\{e\}^\{g\}\(g\_\{r\}\);\\mathbf\{e\}^\{a\}\(a\_\{r\}\);\\mathbf\{e\}^\{n\}\(n\_\{r\}\)\]\)\\in\\mathbb\{R\}^\{128\},\(5\)whereDdemoD\_\{\\mathrm\{demo\}\}is a two\-layer MLP with GELU activation\.

### 3\.3Text\-Perspective Fusion Modalities

To integrate socio\-demographics within our latent representations and obtain the perspectivist social dimension classification, we devise different fusion strategies acting at three depths, plus a baseline\.

#### Text\-only baseline

This approach simply leverages the textual encoding𝐳t​e​x​t\\mathbf\{z\}^\{text\}, ignoring𝐝r\\mathbf\{d\}\_\{r\}\. Note that, since we discard the sociodemographic conditioning, any improvement over this baseline suggests that sociodemographic integration contributes to prediction, beyond textual content\.

#### Additive fusion

Predictions under this fusion modality are obtained as:

𝐳c,radd=𝐳ctext\+Wd​𝐝r\+𝐛d,\\mathbf\{z\}^\{\\mathrm\{add\}\}\_\{c,r\}=\\mathbf\{z\}^\{\\mathrm\{text\}\}\_\{c\}\+W\_\{d\}\\mathbf\{d\}\_\{r\}\+\\mathbf\{b\}\_\{d\},\(6\)whereWdW\_\{d\}and𝐛d\\mathbf\{b\}\_\{d\}are zero\-initialized, so the model starts demographic\-blind and learns group\-conditioned residuals only where the training data support them\. This represents the most conservative fusion strategy, as text remains the primary evidence, while demographics can only adjust the label logits\.

#### Early fusion

This modality concatenates the pooled textual representation and socio\-demographic encodings before classification:

𝐳c,rearly=Ce​a​r​l​y​\(\[𝐡c;𝐝r\]\),\\mathbf\{z\}^\{\\mathrm\{early\}\}\_\{c,r\}=C\_\{early\}\(\[\\mathbf\{h\}\_\{c\};\\mathbf\{d\}\_\{r\}\]\),\(7\)whereCe​a​r​l​yC\_\{early\}follows the classification head ofCt​e​x​tC\_\{text\}but receives as input both the text vector and the demographic vector\. Note that this fusion strategy can learn deeper, non\-additive, text\-demographic interactions\.

#### Concat\-then\-encode fusion

This represents the deepest integration strategy, as it prepends the demographic tuple to the text before tokenization and encoding, such as:

\[Female, 25\-34, UK\]​⟨text⟩\.\\texttt\{\[Female, 25\-34, UK\] \}\\langle\\text\{text\}\\rangle\.
After concatenation, text is processed through the text\-only encoder; thus this modality does not use the encoder described in[Section˜3\.2](https://arxiv.org/html/2606.07123#S3.SS2)\. Here, the deepest integration stems from the possibility for every Transformer layer to attend the “conditioning” prefix while encoding the text\.

### 3\.4Training

Given the prediction settings in[Section˜2](https://arxiv.org/html/2606.07123#S2)and the architectures above, training differs only in which target vector is paired with each model output\. All models produce one logit and one probability per social dimension\. We write𝐳c∈ℝ\|ℒ\|\\mathbf\{z\}\_\{c\}\\in\\mathbb\{R\}^\{\|\\mathcal\{L\}\|\}for candidate\-level logits and𝐩c=σ​\(𝐳c\)∈\[0,1\]\|ℒ\|\\mathbf\{p\}\_\{c\}=\\sigma\(\\mathbf\{z\}\_\{c\}\)\\in\[0,1\]^\{\|\\mathcal\{L\}\|\}for the corresponding probabilities\. Thus,pc​ℓp\_\{c\\ell\}is the model probability that social dimensionℓ\\ellapplies to candidatecc\. For models that work on disaggregated single\-annotation labels, we analogously write𝐳c\(r\)\\mathbf\{z\}\_\{c\}^\{\(r\)\}and𝐩c\(r\)=σ​\(𝐳c\(r\)\)\\mathbf\{p\}\_\{c\}^\{\(r\)\}=\\sigma\(\\mathbf\{z\}\_\{c\}^\{\(r\)\}\), wherepc,ℓ\(r\)p\_\{c,\\ell\}^\{\(r\)\}estimates the probability that an annotator with profile𝐦r\\mathbf\{m\}\_\{r\}assigns labelℓ\\ellto candidatecc\.

#### Majority Social Dimension Prediction

For the candidate\-level model, the input is the textxcx\_\{c\}\. We train three variants with the same probability output𝐩c\\mathbf\{p\}\_\{c\}: \(i\) a hard\-label one targetingycm​a​jy\_\{c\}^\{maj\}, \(ii\) a soft\-level variant targetingscm​a​js\_\{c\}^\{maj\}, and \(iii\) a variation of the latter focusing on MSE/Brier scores to directly optimize probability errors\.

#### Perspectivist Social Dimension Prediction

For the annotation\-level prediction, the input is\(xc,𝐦r\)\(x\_\{c\},\\mathbf\{m\}\_\{r\}\)and the model outputs𝐩c\(r\)∈\[0,1\]\|ℒ\|\\mathbf\{p\}\_\{c\}^\{\(r\)\}\\in\[0,1\]^\{\|\\mathcal\{L\}\|\}, where each component estimatesP​\(vc,ℓ\(r\)=1∣xc,𝐦r\)P\(v\_\{c,\\ell\}^\{\(r\)\}=1\\mid x\_\{c\},\\mathbf\{m\}\_\{r\}\)\.

As mentioned in[Section˜2](https://arxiv.org/html/2606.07123#S2), during training, we use the observed annotation vector𝐯c\(r\)\\mathbf\{v\}\_\{c\}^\{\(r\)\}derived from individual annotationsvc,ℓ\(r\)v\_\{c,\\ell\}^\{\(r\)\}as the target\. Conversely, at inference, fixing𝐦r=G\\mathbf\{m\}\_\{r\}=Ggives a group\-conditioned prediction𝐬^c\(G\)=𝐩c\(G\)\\hat\{\\mathbf\{s\}\}^\{\(G\)\}\_\{c\}=\\mathbf\{p\}^\{\(G\)\}\_\{c\}, which serves as an estimate of the expected soft perspective of groupGGon textcc\. Note that the corresponding hard prediction isy^c,ℓ\(G\)=𝕀​\[s^c,ℓ\(G\)\>0\.5\]\\hat\{y\}^\{\(G\)\}\_\{c,\\ell\}=\\mathbb\{I\}\[\\hat\{s\}^\{\(G\)\}\_\{c,\\ell\}\>0\.5\]\. Finally, our text\-only baseline uses the same annotation\-level target but omits𝐦r\\mathbf\{m\}\_\{r\}, so it cannot produce group\-specific predictions\.

#### Objective Functions

LetBCEw\\mathrm\{BCE\}\_\{w\}denote multilabel binary cross\-entropy with positive\-label weighing, which accounts for rare labels \(we similarly account for underrepresented demographics via sampling\)\. The aforementioned training objectives can be summarized as follows:

𝒥hard\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{hard\}\}=BCEw​\(𝐲cmaj,𝐩c\),\\displaystyle=\\mathrm\{BCE\}\_\{w\}\(\\mathbf\{y\}^\{\\mathrm\{maj\}\}\_\{c\},\\mathbf\{p\}\_\{c\}\),\(8\)𝒥soft\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{soft\}\}=BCEw​\(𝐬cmaj,𝐩c\),\\displaystyle=\\mathrm\{BCE\}\_\{w\}\(\\mathbf\{s\}^\{\\mathrm\{maj\}\}\_\{c\},\\mathbf\{p\}\_\{c\}\),\(9\)𝒥mse\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{mse\}\}=1\|ℒ\|​∑ℓ∈ℒ\(pc,ℓ−sc,ℓmaj\)2,\\displaystyle=\\frac\{1\}\{\|\\mathcal\{L\}\|\}\\sum\_\{\\ell\\in\\mathcal\{L\}\}\(p\_\{c,\\ell\}\-s^\{\\mathrm\{maj\}\}\_\{c,\\ell\}\)^\{2\},\(10\)𝒥persp\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{persp\}\}=BCEw​\(𝐯c\(r\),𝐩c\(r\)\)\.\\displaystyle=\\mathrm\{BCE\}\_\{w\}\(\\mathbf\{v\}\_\{c\}^\{\(r\)\},\\mathbf\{p\}\_\{c\}^\{\(r\)\}\)\.\(11\)Note that candidate\-level models differ by whether they learn from hard consensus labels or soft annotation fractions, while annotation\-level models learn from individual annotator votes, and their probabilities are interpreted as estimates of the expected group\-level soft perspective when a demographic profile is provided\.

## 4Experimental Setup

We evaluate onP1SCO, a dataset with social labels and annotator demographic information\(Curryet al\.,[2026](https://arxiv.org/html/2606.07123#bib.bib33)\)\.

### 4\.1Evaluation

#### Majority Setting

Each test sample is a candidatecc\. The model outputs the probability vector𝐩c\\mathbf\{p\}\_\{c\}over the social dimensions, and we evaluate these probabilities against two candidate\-level targets: the hard majority vector𝐲cmaj\\mathbf\{y\}^\{\\mathrm\{maj\}\}\_\{c\}and the soft annotation\-fraction vector𝐬cmaj\\mathbf\{s\}^\{\\mathrm\{maj\}\}\_\{c\}\.

#### Perspectivist Setting

Each test sample is a candidate\-annotator pair\(c,r\)\(c,r\)\. The model receives the textxcx\_\{c\}and the annotator profile𝐦r\\mathbf\{m\}\_\{r\}, and outputs𝐩c\(r\)\\mathbf\{p\}\_\{c\}^\{\(r\)\}\. This vector estimates how an annotator with profile𝐦r\\mathbf\{m\}\_\{r\}would labelxcx\_\{c\}, and evaluation compares it with the observed individual label vector𝐯c\(r\)\\mathbf\{v\}\_\{c\}^\{\(r\)\}\. A group\-level perspective can be obtained at inference\-time by fixing a profileGGand reading𝐩c\(G\)\\mathbf\{p\}^\{\(G\)\}\_\{c\}as the expected soft perspective of that group\.

#### Metrics

We use macro PR\-AUC as the primary metric under both settings, computed against the hard targets, i\.e\.,𝐲cmaj\\mathbf\{y\}^\{\\mathrm\{maj\}\}\_\{c\}for the majority setting, and𝐯c\(r\)\\mathbf\{v\}\_\{c\}^\{\(r\)\}for the perspectivist one\. This chioce ensures giving equal weight to each social dimension, is threshold\-free, and is particularly suitable for sparse, multilabel data\. For the majority setting, we also report the Brier error against soft targets, whereby lower values indicate model probabilities better match the observed annotator agreement\.

To compare the various perspectivist encodings, we also report PR\-AUC confidence intervals over10001000\-sample bootstraps over candidates\.

## 5Results

Table 1:Majority\-task results\. PR\-AUC is macro PR\-AUC against majority labels \(mean±\\pmstd over three seeds\)\. “Fractions” denotes annotation\-fraction supervision, and Brier evaluates probability fit to annotation fractions; lower is better\.Table 2:Perspectivist\-task macro PR\-AUC \(mean±\\pmstd over three seeds\)\.Δ\\Deltaand CIs are absolute PR\-AUC deltas in percentage points; relative gains are shown in parentheses\.†\\dagger= 95% CI excludes zero\.Table 3:Per\-label PR\-AUC \(mean\), sorted by absolute additive\-minus\-text gain\. Relative gains are shown in parentheses\.Italicrow = negative gain\.Table 4:Shuffle ablation averaged over three seeds\. “Normal” and “Shuffled” report macro PR\-AUC with the full G\+A\+N triplet\.Δ\\Deltais the absolute shuffled\-minus\-normal difference, with relative drop in parentheses\.Table 5:Additive model macro PR\-AUC by demographic subset \(mean over 3 seeds\)\. G=gender, A=age, N=nationality\.Table 6:Most informative demographic subset per label \(additive, mean over 3 seeds\)\. 9/10 labels: full triplet is best\.Italic: romance uniquely benefits from gender alone \(\+1\.5pp over full triplet\)\.[Table˜1](https://arxiv.org/html/2606.07123#S5.T1)reports Majority Prediction results across the three supervision objectives\. Replacing hard majority\-vote labels with soft annotation fractions \(Soft BCE\) yields a\+3\.3\+3\.3pp improvement in macro PR\-AUC \(0\.438±0\.0090\.438\\pm 0\.009vs\.0\.405±0\.0110\.405\\pm 0\.011\)\. This result confirms that annotator disagreement carries predictive signal beyond what majority\-vote labels encode\. The MSE/Brier objective matches Soft BCE on PR\-AUC \(0\.438±0\.0080\.438\\pm 0\.008\) while achieving a substantially lower Brier score \(0\.02950\.0295vs\.0\.04020\.0402\), reflecting its direct optimisation of probability calibration\. Hard BCE produces the worst calibration of the three \(Brier=0\.0455\\text\{Brier\}=0\.0455\), thus confirming that soft targets improve probability estimates regardless of the specific loss used\. Overall, soft supervision is strictly preferable to hard\-label training: annotation fractions improve discrimination, and the MSE/Brier objective further improves calibration at no cost to PR\-AUC\.

### 5\.1Effect of Demographic Conditioning

[Table˜2](https://arxiv.org/html/2606.07123#S5.T2)reports Perspectivist Prediction macro PR\-AUC for the text\-only baseline and the three fusion strategies\. All three fusion strategies significantly outperform the text\-only baseline: the additive model gains\+2\.53\+2\.53pp \(\+6\.5%\+6\.5\\%\), early fusion\+2\.34\+2\.34pp \(\+6\.0%\+6\.0\\%\), and concat\-and\-encode\+2\.30\+2\.30pp \(\+5\.9%\+5\.9\\%\)\. Demographic conditioning therefore provides a consistent and reliable improvement regardless of how deeply the demographic signal is integrated into the model\. Contrary to our initial expectations, deeper fusion does*not*yield larger gains\. Instead, shallower integration depth is more beneficial \(additive≥\\geqearly fusion≥\\geqconcat\-and\-encode\), though pairwise differences are non\-significant \(early fusion vs\. additive:Δ=−0\.19\\Delta=\-0\.19pp,\[−0\.7,\+0\.3\]\[\-0\.7,\+0\.3\]; concat\-and\-encode vs\. additive:Δ=−0\.23\\Delta=\-0\.23pp,\[−0\.8,\+0\.3\]\[\-0\.8,\+0\.3\]\)\. We speculate this effect might be attributed to data sparsity: the per\-group annotator counts inP1SCOare insufficient to exploit the additional capacity of deeper integration, while the zero\-initialised additive residuals provide a strong inductive bias that prevents demographic corrections from overriding the text signal\. See Appendix[C](https://arxiv.org/html/2606.07123#A3)for detailed deltas per group\.

### 5\.2Prompting vs Fine\-Tuning

We compare our proposed architecture to model performance in zero\- and few\-shot settings using a mixture of open and closed\-weight models of different sizes\. We also include abliterated versions of the models\. We evaluate all models on the11041104\-example test split ofP1SCOunder two ground truth conditions: a strict majority label \(y\) and a lenient any\-annotator label \(s\>0s\>0\)\. Prompts can be found in Appendix[A](https://arxiv.org/html/2606.07123#A1)\.[Table˜7](https://arxiv.org/html/2606.07123#S5.T7)shows detailed results per model and metric\. All models are well above the stratified random baseline \(macro F1=0\.10\\text\{macro F1\}=0\.10\) under the majority condition, with the exception of DeepSeek\-R1\-7B in zero\-shot \(0\.060\.06\)\. GPT\-4o\-mini zero\-shot achieves the highest macro F1 under the majority condition \(0\.360\.36\), followed closely by Llama\-3\.1\-8B\-abliterated few\-shot \(0\.340\.34\) and DeepSeek\-R1\-14B few\-shot \(0\.330\.33\)\. Few\-shot prompting consistently outperforms zero\-shot across models, and allowing the model to abstain with anonelabel generally yields better results than forcing a prediction\. Abliterated models perform comparably to their non\-abliterated counterparts, with Llama\-3\.1\-8B\-abliterated marginally outperforming the base Llama\-3\.1\-8B in both conditions\. Under the any\-annotator condition, rankings shift considerably: Mistral\-7B\-v0\.3 emerges as the strongest model \(0\.470\.47\), suggesting it produces broader label sets that align well with minority annotator judgements\. Across conditions, our approach outperforms prompting methods\.

Table 7:Benchmark results on the test set \(N==11041104\)\. Majority uses the hard majority label; Perspectivist treats a label as positive if any annotator assigned it\. For each model, the best result across prompt conditions \(none/no\-none\) is reported\. Abliterated models are marked \-abl\. Bold: best within group;†\\dagger: best overall\. See Appendix[B](https://arxiv.org/html/2606.07123#A2)for more\.

## 6Analysis

### 6\.1Fusion Mechanism Insights

[Table˜2](https://arxiv.org/html/2606.07123#S5.T2)and seed\-level bootstrap confidence intervals confirm that all three fusion strategies significantly outperform the text\-only perspectivist baseline, while pairwise differences between fusion strategies remain non\-significant \(early fusion vs\. additive:Δ=−0\.19\\Delta=\-0\.19pp,\[−0\.7,\+0\.3\]\[\-0\.7,\+0\.3\]; concat\-then\-encode vs\. additive:Δ=−0\.23\\Delta=\-0\.23pp,\[−0\.8,\+0\.3\]\[\-0\.8,\+0\.3\]\)\. The choice of fusion depth is therefore not the critical factor—what matters is that demographic information is included at all\.

The additive model’s competitive performance despite its architectural simplicity is informative\. Its zero\-initialized weights allow the model to begin training as a text\-only classifier and acquire demographic corrections only where annotation data supports them, acting as an implicit regularizer well\-suited to the sparse per\-group sample counts in P1SCO\. This feature might explain why deeper fusion does not translate into better generalisation\.

At the label level, fusion gains cluster around the most semantically underspecified dimensions\. Power and Trust show the largest absolute improvements \([Table˜3](https://arxiv.org/html/2606.07123#S5.T3)\), consistent with the intuition that these dimensions depend strongly on relational context and social positioning that demographic background helps calibrate\. Knowledge, Support, and Conflict, more directly recoverable from lexical content, see more modest gains\. Romance is the sole exception: gender alone outperforms the full gender–age–nationality triplet by \+1\.5pp \([Table˜6](https://arxiv.org/html/2606.07123#S5.T6)\), the only dimension for which adding age and nationality degrades performance\. We attribute this to the extreme sparsity of romance labels \(≈\\approx3% prevalence\), where nationality and age introduce statistical noise and sparsity that dilutes the gender signal\. Taken together, the per\-label pattern shows that demographic conditioning is most valuable precisely where interpretation is most contested—thus validating the perspectivist framing of the task\.

### 6\.2Perspectivist Behaviour

To examine whether demographic conditioning produces genuinely perspectival predictions rather than a uniform shift in label probabilities, we decompose macro PR\-AUC gains by demographic subgroup\. For each annotator attribute \(gender, age, nationality\), we compare the text\-only baseline against the additive model averaged across three seeds\. For gender, male annotators benefit considerably more from conditioning than female annotators \(\+4\.6 pp vs\. \+0\.7 pp\)\. Similarly for nationality, US annotators gain \+4\.1 pp against the text\-only baseline, while UK annotators gain only \+0\.6 pp\. The most pronounced effect is for age: the 18\-–24 group sees an average gain of \+21\.8 pp \(from 42\.4 to 64\.1 macro PR\-AUC\), far exceeding any other age band, which all fall below \+1\.1 pp\. These patterns suggest that the text\-only model already captures the perspectives of certain groups reasonably well, while demographic conditioning primarily helps groups whose readings diverge from that default\.

For example, for the comment“That’s funny because I’ve never seen the Colonel so closed minded”the text\-only model assigns a uniformly low probability of Power \(p =0\.0570\.057\) and Trust \(p =0\.0580\.058\) to all annotators, unable to differentiate between them\. The additive model, in turn, produces divergent predictions: a Male, 18–24, US annotator receivesp​\_​p​o​w​e​r=0\.985p\\\_power=0\.985andp​\_​t​r​u​s​t=0\.953p\\\_trust=0\.953while Female annotators aged 35–44 from both the UK and the US, and a Male 45–54 UK annotator, receivep​\_​p​o​w​e​rp\\\_power≈\\approx0\.150\.15andp​\_​t​r​u​s​tp\\\_trust≈\\approx0\.090\.09, consistent with their labels of 0\. The comment’s implicit reference to authority and challenge appears to activate a power reading specifically for the younger US male, a perspectival response the model has learned to reproduce\.

#### Comparison with LLMs\.

To contextualise these findings, we compare our models against four strong LLMs \(GPT\-4o\-mini, Mistral\-7B, Llama\-3\.1\-8B, DeepSeek\-R1\-14B; all few\-shot\) on per\-group macro F1 agreement with each demographic group’s majority labels\.

Our additive fusion model achieves the highest raw F1 for every demographic group, confirming that explicit demographic conditioning is strictly more perspectivist than prompt\-based elicitation\. This gap is widest for the 18–24 cohort, where fusion reaches0\.5420\.542vs\.0\.3680\.368of the next\-best model \(Mistral\-7B\)\. We find that LLMs exhibit their own implicit demographic alignments\. GPT\-4o\-mini is systematically most misaligned with 18–24 annotators \(Δ=−0\.048\\Delta=\-0\.048\) and with 60\+ annotators \(Δ=−0\.026\\Delta=\-0\.026\); Mistral\-7B shows the opposite tendency, over\-representing 18–24 perspectives \(Δ=\+0\.036\\Delta=\+0\.036\)\. These skews are presumably inherited from pre\-training and RLHF data distributions and cannot be corrected without retraining\. Our demographic\-conditioned model, by contrast, makes its group\-level assumptions explicit, auditable, and operationally distinct from uninstructed text processing\.

In summary, the label\- and model\-level analyses suggest that the fusion model’s differential gains represent*intended perspectivism*where the model has learned to predict differently for groups that perceive social dimensions differently\.

## 7Discussion

Our results confirm that demographic profiles carry genuine predictive signal: shuffle ablations show a7\.27\.2–7\.37\.3% relative drop when demographic features are randomised \([Table˜4](https://arxiv.org/html/2606.07123#S5.T4)\), and the full gender–age–nationality triplet consistently outperforms any of its subsets \([Table˜5](https://arxiv.org/html/2606.07123#S5.T5)\)\. However, this group\-level signal is necessarily an approximation of individual interpretive behaviour\. Within any demographic group, substantial variation remains, a finding directly visible in the P1SCO annotation distributions\.Orlikowskiet al\.\([2023](https://arxiv.org/html/2606.07123#bib.bib29)\)warn against the ecological fallacy in annotation: inferring individual\-level behaviour from group\-level statistics\. Our framework is susceptible to exactly this fallacy when predictions are read as individual rather than population\-level estimates\. We recommend that downstream applications treat model outputs as soft, distributional signals over demographic groups rather than as predictions about any particular annotator\.

Demographic conditioning improves aggregate performance, but this comes with an inherent tension: a model that predicts differently for a 25–34\-year\-old woman from the UK than for a 55–64\-year\-old man from the US is making assumptions about individuals based on group membership and risks stereotyping\. We argue the distinction lies in intended use: demographic conditioning is appropriate when the goal is to describe the distribution of interpretations across a population, e\.g\., to understand how a piece of content will land across different communities\. However, it becomes problematic if it is used to make inferences about, or decisions affecting, specific individuals\. Our models should therefore be understood as tools for aggregate perspective modelling, not individual prediction\.

While our findings support that demographic conditioning provides consistent and statistically significant improvements over text\-only baselines, the absolute gains of approximately 2\.5 percentage points appear small in isolation\. We note, however, that these gains are uniform across all three fusion strategies and survive shuffle ablations, indicating they reflect genuine demographic signal rather than noise\. Whether gains of this magnitude justify the additional complexity of demographic data collection and model conditioning in real\-world deployments will depend on the application: for high\-stakes settings where interpretive diversity matters, such as content moderation or public health communication, even small systematic improvements may be worthwhile, whereas for lower\-stakes applications a text\-only model may be sufficient\.

## 8Related Work

Choiet al\.\([2020](https://arxiv.org/html/2606.07123#bib.bib21)\)introduced the ten\-dimensional framework of social meaning we build on, demonstrating that dimensions such as knowledge, power, and support can be reliably detected from conversational text\. Subsequent work has applied this framework to study opinion dynamics on social media\(Montiet al\.,[2022](https://arxiv.org/html/2606.07123#bib.bib22)\), community well\-being\(Lucchiniet al\.,[2022](https://arxiv.org/html/2606.07123#bib.bib23); Aielloet al\.,[2021](https://arxiv.org/html/2606.07123#bib.bib24)\), and peer support in health contexts\(Balsamoet al\.,[2023](https://arxiv.org/html/2606.07123#bib.bib25)\)\. A key limitation of this line of work is its reliance on aggregated, majority\-vote labels from a homogeneous annotator pool, obscuring the interpersonal variation\. A growing body of work challenges the assumption that annotation disagreement is noise to be resolved and argues that human label variation is a fundamental property of language data and calls for models that embrace rather than flatten it\(e\.g\. Plank,[2022](https://arxiv.org/html/2606.07123#bib.bib11); Basileet al\.,[2021](https://arxiv.org/html/2606.07123#bib.bib19); Umaet al\.,[2021](https://arxiv.org/html/2606.07123#bib.bib20); Ovesdotter Alm,[2011](https://arxiv.org/html/2606.07123#bib.bib18)\)\. Motivated by this work,Curryet al\.\([2026](https://arxiv.org/html/2606.07123#bib.bib33)\)provided disaggregated annotations of the ten social dimensions on social media comments\.

Several studies have demonstrated that annotator background shapes subjective judgments in tasks such as hate speech detection, sentiment analysis, and politeness classification\(Waseem,[2016](https://arxiv.org/html/2606.07123#bib.bib26); Sapet al\.,[2022](https://arxiv.org/html/2606.07123#bib.bib27); Goyalet al\.,[2022](https://arxiv.org/html/2606.07123#bib.bib32)\)\. In response, recent modelling work has explored incorporating annotator identity including gender, age, and political orientation as auxiliary signals, either through multi\-task objectives or by conditioning classification heads on demographic embeddings\(Mostafazadeh Davaniet al\.,[2022](https://arxiv.org/html/2606.07123#bib.bib28); Orlikowskiet al\.,[2023](https://arxiv.org/html/2606.07123#bib.bib29); Pei and Jurgens,[2023](https://arxiv.org/html/2606.07123#bib.bib34)\)\. Our fusion embedding framework extends this line of work by systematically comparing integration depth, from late additive adjustment to early concatenation before encoding, and by providing controlled shuffle ablations that isolate the genuine predictive contribution of demographic signals\.

## 9Conclusion

We introduced a perspectivist framework for social dimension prediction, establishing the first comprehensive benchmarks on P1SCO across zero\-shot, few\-shot, and fine\-tuned paradigms\. We show that annotator demographic profiles carry genuine, non\-spurious predictive signal: demographic\-conditioned fusion embeddings yield consistent and statistically significant improvements over text\-only baselines across all three fusion strategies \(\+5\.95\.9–6\.56\.5% relative macro PR\-AUC\), confirmed by shuffle ablations that return performance to the text\-only range when demographic assignments are broken\. Contrary to our expectations, shallower fusion is as effective as deeper integration, a result we attribute to the sparsity of per\-group annotations in P1SCO and the strong inductive bias of additive zero\-initialised residuals\. Per\-label analysis reveals that gains concentrate on the most semantically underspecified dimensions—Power and Trust—where shared demographic experience most strongly mediates interpretation, while Romance uniquely benefits from gender alone\. Subgroup analysis further shows that conditioning disproportionately benefits groups whose perspectives diverge from the text\-only default\.

Future work should pursue richer and intersectional demographic representations, cross\-dataset validation, and methods that move beyond group\-level stereotyping towards more individual\-aware perspective modelling\.

## Limitations

While our results demonstrate the value of demographic conditioning for perspectivist social dimension prediction, several limitations bound the scope of our conclusions:

#### Annotation sparseness

Per\-group annotator counts in P1SCO are uneven, with some demographic combinations represented by very few participants\. We attribute the failure of deeper fusion strategies to outperform additive fusion partly to this sparsity: the additional model capacity offered by early and concat\-then\-encode fusion cannot be fully exploited when group\-level signal is thin\. Richer modelling of intersectional identities in particular will require datasets with substantially larger per\-group annotator counts\.

#### Limited demographic variables studied

We restricted ourselves to three demographic variables \(gender, age, nationality\)\. Personality traits, political orientation, and education exist in the dataset but are not currently modelled\. We are also limit our study to binary gender\. Moreover, the dataset only includes data in English and participants from the UK and US so our results may not generalise across languages and cultures\. Extending perspectivist social dimension modelling to multilingual and cross\-cultural settings remains an important direction for future work\.

#### Evaluation setup

We test our proposed architecture on only one dataset\. There’s no cross\-dataset validation, so it is unknown whether the demographic conditioning gains generalise to other social dimension datasets or annotation setups\.

#### Practical significance

PR\-AUC gains are modest in absolute terms, although results are consistent across fusion strategies and robust to shuffle ablations\.

## Acknowledgments

AI\-based assistants were used to support code development and provide writing and editing assistance during manuscript preparation\. All content, analyses, and conclusions were reviewed and verified by the authors\.

## References

- How epidemic psychology works on Twitter: evolution of responses to the COVID\-19 pandemic in the US\.Vol\.8\.External Links:[Document](https://dx.doi.org/10.1057/s41599-021-00861-3)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p2.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- D\. Balsamo, P\. Bajardi, G\. De Francisci Morales, C\. Monti, and R\. Schifanella \(2023\)The pursuit of peer support for opioid use recovery on Reddit\.InProceedings of the International AAAI Conference on Web and Social Media,Vol\.17,pp\. 12–23\.Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p2.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- V\. Basile, M\. Fell, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, M\. Poesio, and A\. Uma \(2021\)We need to consider disagreement in evaluation\.InProceedings of the 1st Workshop on Benchmarking: Past, Present and Future,Online,pp\. 15–21\.External Links:[Link](https://aclanthology.org/2021.bppf-1.3),[Document](https://dx.doi.org/10.18653/v1/2021.bppf-1.3)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- M\. Choi, L\. M\. Aiello, K\. Z\. Varga, and D\. Quercia \(2020\)Ten social dimensions of conversations and relationships\.InProceedings of The Web Conference 2020,Taipei, Taiwan,pp\. 1514–1525\.External Links:[Document](https://dx.doi.org/10.1145/3366423.3380224)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p2.1),[Table 7](https://arxiv.org/html/2606.07123#S5.T7.8.14.6.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- A\. C\. Curry, G\. de Francisci Morales, and L\. M\. Aiello \(2026\)P1SCO: social dimensions from a perspectivist lens\.External Links:2605\.25312,[Link](https://arxiv.org/abs/2605.25312)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p3.1),[§4](https://arxiv.org/html/2606.07123#S4.p1.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- N\. Goyal, I\. D\. Kivlichan, R\. Rosen, and L\. Vasserman \(2022\)Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation\.Proc\. ACM Hum\.\-Comput\. Interact\.6\(CSCW2\)\.External Links:[Link](https://doi.org/10.1145/3555088),[Document](https://dx.doi.org/10.1145/3555088)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.
- D\. Hovy and D\. Yang \(2021\)The importance of modeling social factors of language: theory and practice\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 588–602\.External Links:[Link](https://aclanthology.org/2021.naacl-main.49/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.49)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: A robustly optimized BERT pretraining approach\.CoRRabs/1907\.11692\.External Links:[Link](http://arxiv.org/abs/1907.11692),1907\.11692Cited by:[§3\.1](https://arxiv.org/html/2606.07123#S3.SS1.p1.6)\.
- L\. Lucchini, L\. M\. Aiello, L\. Alessandretti, G\. De Francisci Morales, M\. Starnini, and A\. Baronchelli \(2022\)From reddit to Wall street: the role of committed minorities in financial collective action\.Royal Society Open Science9\(4\),pp\. 211488\.External Links:[Document](https://dx.doi.org/10.1098/rsos.211488)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p2.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- C\. Monti, L\. M\. Aiello, G\. De Francisci Morales, and F\. Bonchi \(2022\)The language of opinion change on social media under the lens of communicative action\.Scientific Reports12\(1\),pp\. 17920\.External Links:[Document](https://dx.doi.org/10.1038/s41598-022-21720-8)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p2.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- A\. Mostafazadeh Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.External Links:[Link](https://aclanthology.org/2022.tacl-1.6),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00449)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.
- M\. Orlikowski, P\. Röttger, P\. Cimiano, and D\. Hovy \(2023\)The ecological fallacy in annotation: modeling human label variation goes beyond sociodemographics\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Toronto, Canada\.External Links:[Link](https://aclanthology.org/2023.acl-short.88),[Document](https://dx.doi.org/10.18653/v1/2023.acl-short.88)Cited by:[§7](https://arxiv.org/html/2606.07123#S7.p1.2),[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.
- C\. Ovesdotter Alm \(2011\)Subjective natural language problems: motivations, applications, characterizations, and implications\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,D\. Lin, Y\. Matsumoto, and R\. Mihalcea \(Eds\.\),Portland, Oregon, USA,pp\. 107–112\.External Links:[Link](https://aclanthology.org/P11-2019/)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- J\. Pei and D\. Jurgens \(2023\)When do annotator demographics matter? measuring the influence of annotator demographics with the POPQUORN dataset\.InProceedings of the 17th Linguistic Annotation Workshop \(LAW\-XVII\),J\. Prange and A\. Friedrich \(Eds\.\),Toronto, Canada,pp\. 252–265\.External Links:[Link](https://aclanthology.org/2023.law-1.25/),[Document](https://dx.doi.org/10.18653/v1/2023.law-1.25)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.
- B\. Plank \(2022\)The “problem” of human label variation: on ground truth in data, modeling and evaluation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 10671–10682\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.731/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.731)Cited by:[§1](https://arxiv.org/html/2606.07123#S1.p1.1),[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- M\. Sap, S\. Swayamdipta, L\. Vianna, X\. Zhou, Y\. Choi, and N\. A\. Smith \(2022\)Annotators with attitudes: how annotator beliefs and identities bias toxic language detection\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Seattle, United States,pp\. 252–264\.External Links:[Link](https://aclanthology.org/2022.naacl-main.431),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.431)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.
- A\. N\. Uma, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, and M\. Poesio \(2021\)Learning from disagreement: a survey\.Journal of Artificial Intelligence Research72,pp\. 1385–1470\.External Links:[Document](https://dx.doi.org/10.1613/jair.1.12752),[Link](https://jair.org/index.php/jair/article/view/12752)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p1.1)\.
- Z\. Waseem \(2016\)Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter\.InProceedings of the First Workshop on NLP and Computational Social Science,Austin, Texas,pp\. 138–142\.External Links:[Link](https://aclanthology.org/W16-5618),[Document](https://dx.doi.org/10.18653/v1/W16-5618)Cited by:[§8](https://arxiv.org/html/2606.07123#S8.p2.1)\.

## Appendix APrompts

Prompt template \(zero\-shot\)You are an expert text annotation tool\. You output only valid JSON\. Never explain, never add commentary\.Annotate the text below for the following social dimensions\.Dimensions:Rules:•Only label a dimension if clearly expressed\.•Multiple labels are allowed\.•Never combinenonewith other labels\.•Output only JSON\. No explanation\. No markdown\.Output format:\{\{"labels": \["label1", "label2"\]\}\}Text:\{candidate\_text\}

Prompt template \(few\-shot\)You are an expert annotator of social dimensions\. Label the text with ALL applicable dimensions\.Dimensions:Rules:•Only label a dimension if clearly expressed\.•Multiple labels are allowed\.•Never combinenonewith other labels\.•If no dimension is present, return an empty list\.•Output ONLY valid JSON\. No explanation\.Output format:\{"labels": \["label1", "label2"\]\}Examples:“I’d recommend using canned beer instead of bottled beer the first few times\.”\["knowledge"\]“You must submit your report by noon or there will be consequences\.”\["power"\]“You did an amazing job on this project\!”\["status"\]“I trust your judgment—go ahead with the plan\.”\["trust"\]“I’m really sorry you’re going through this\. I’m here for you\.”\["support"\]“I totally agree with your views on this topic\.”\["similarity"\]“As fellow artists, we understand the struggle\.”\["identity"\]“Haha that’s hilarious, you made my day\!”\["fun"\]“Your argument makes no sense and you’re ignoring the facts\.”\["conflict"\]“I think about you all the time, you make my days brighter\.”\["romance"\]“I totally agree with you, and you explained it really well\.”\["similarity", "status"\]“The sky is blue\.”\["none"\]Now classify:Text:\{candidate\_text\}

## Appendix BDetailed results per dimension

Table 8:Per\-label PR\-AUC on the test set \(Majority task\) for all models\. Bold: best per column\.Table 9:Per\-label PR\-AUC on the test set \(Perspectivist condition\) for all models\. Bold: best per column\. For LLMs and Majority task, Perspectivist uses candidate\-level any\-annotator labels \(s\>0s\>0\); for Perspectivist task, annotation\-level evaluation is used \(individual annotator labels\)\.
## Appendix CDetailed results per socio\-demographic subgroup

![Refer to caption](https://arxiv.org/html/2606.07123v1/subgroup_heatmap.png)Figure 1:Per\-label PR\-AUC delta \(additive vs\. text\-only baseline\) broken down by demographic subgroup\.

Similar Articles

Embeddings for Preferences, Not Semantics

arXiv cs.AI

This paper introduces a new embedding model designed to capture preferential similarity rather than just semantic similarity, improving preference prediction for collective decision-making systems.

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

arXiv cs.CL

This paper critiques the 'Proxy Presumption' in NLP, where geometric embedding properties are incorrectly equated with social constructs. It introduces the Construct Validity Protocol and Counterfactual Neutralization methods to ensure rigorous validation of social measures derived from semantic embeddings.