MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

arXiv cs.CL 05/11/26, 04:00 AM Papers
nlp benchmark llm-annotation low-resource-languages bias bengali
Summary
This paper introduces MultiSoc-4D, a benchmark for diagnosing instruction-induced label collapse in LLMs annotating Bengali social media. It reveals that LLMs systematically prefer fallback labels, leading to under-detection of minority categories like hate speech and sarcasm.
arXiv:2605.06940v1 Announce Type: new Abstract: Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($\kappa \approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:41 AM
# MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
Source: [https://arxiv.org/html/2605.06940](https://arxiv.org/html/2605.06940)
Souvik Pramanik1,\*,S\.M\. Riaz Rahman Antu1,Shak Mohammad Abyad1, Md\. Ibrahim Khalil1,Md\. Shahriar Hussain1, \{souvik\.pramanik, riaz\.antu, shak\.abyad, ibrahim\.khalil03, shahriar\.hussain01\}@northsouth\.edu 1North South University, Dhaka, Bangladesh \*Correspondence:[souvik\.pramanik@northsouth\.edu](https://arxiv.org/html/2605.06940v1/mailto:[email protected])

###### Abstract

Annotation automation via Large Language Models \(LLMs\) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed\-set instructions in low\-resource languages has not been well studied\. We presentMultiSoc\-4D, a Bengali social media dataset benchmark, which contains 58K\+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm\. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically\. We discover a prevalent phenomenon called“instruction\-induced label collapse”, wherein LLMs show a systematic preference towards fallback labels \(Other, Neutral, No\), leading to high agreement rates but under\-detection of minority categories\. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human\-calibrated reference\. Furthermore, we prove that it represents a“label agreement illusion”, statistically validated via almost null Fleiss’ Kappa \(κ≈−0\.001\\kappa\\approx\-0\.001\) on sarcasm detection\. Across 40\+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences\. We release MultiSoc\-4D as a diagnostic benchmark for annotation biases in Bengali NLP\.

\\fontspec\_if\_language:nTF

ENG\\addfontfeatureLanguage=English

MultiSoc\-4D: A Benchmark for Diagnosing Instruction\-Induced Label Collapse in Closed\-Set LLM Annotation of Bengali Social Media

Souvik Pramanik1,\*, S\.M\. Riaz Rahman Antu1, Shak Mohammad Abyad1,Md\. Ibrahim Khalil1,Md\. Shahriar Hussain1,\{souvik\.pramanik, riaz\.antu, shak\.abyad, ibrahim\.khalil03,shahriar\.hussain01\}@northsouth\.edu1North South University, Dhaka, Bangladesh\*Correspondence:[souvik\.pramanik@northsouth\.edu](https://arxiv.org/html/2605.06940v1/mailto:[email protected])

Keywords:Bengali NLP⋅\\cdotLabel Collapse⋅\\cdotClosed\-Set Annotation⋅\\cdotLLM Annotation Bias⋅\\cdotLow\-Resource NLP⋅\\cdotSocial Media Dataset⋅\\cdotHate Speech Detection⋅\\cdotInter\-Annotator Agreement

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1Introduction

![Refer to caption](https://arxiv.org/html/2605.06940v1/visabs1.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 1:Exploring Bengali text annotation using LLMs\.![Refer to caption](https://arxiv.org/html/2605.06940v1/intro_figure.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 2:Overview of MultiSoc\-4D dataset and LLM\-based annotation pipeline\.The usage of Large Language Models \(LLMs\) as scalable substitutes for human annotators in building labeled datasets for a diverse set of NLP applications has been increasing\. These models’ good zero\-shot and few\-shot skills facilitate the creation of efficient and economical labeling pipelinesbrown2020language;ouyang2022training\. Consequently, LLM\-driven annotation practices have emerged in various applications including sentiment analysis, hate\-speech detection, and classification of topics\. Despite the increase in the utilization of LLMs for these applications, not much is known about how the models operate as annotators\. The existing literature concentrates on the performance of models without taking into consideration the role played by annotation guidelines and constrained label space in LLM decision\-making processeswang2023self;gilardi2023chatgpt\.

Many annotation frameworks built on the basis of large language models assume a closed set labeling, implying that the prediction made for every observation belongs to a specific set of predefined labels\. Many systems involve uncertainty\-averse prompts, making the model choose from a set of fallback labels in case of low certainty \(e\.g\.,Other,Neutral,No\)\. Although such design was meant to make sure that the system remains consistent, this practice has a downside when applied to the labeling of data collected from social media platforms, which are characterized by diversity, ambiguity, and contextual dependency\.

This study performed using a multilingual social media data in Bengali,MultiSoc\-4D, annotated using multiple LLMs following a unified closed\-set instruction approach\. Our observations are described below:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Label Collapse:There is a tendency for LLM annotations to have a pronounced bias towards certain classes such asOthers,Neutral, andNo\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Asymmetric Agreement:While there is high agreement amongst LLMs for these particular labels, agreement in more sophisticated classes which do not appear often \(such as sarcasm and hate speech\) is almost non\-existent\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Human\-LLM Agreement Discrepancies:As opposed to human annotations, there is evidence of a lack of understanding of implicit information by LLMs\.

The above observations indicate that the level of agreement amongst the LLMs cannot be taken to mean semantic agreement and can rather be attributed to bias from instructions\.

The key contributions of the study includs the following:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•PresentedMultiSoc\-4D, a Bengali social media data set across multiple platforms labeled along four dimensions: category, sentiment, hatefulness, and sarcasm\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Conducted an empirical investigation of LLM labeling using closed\-set instructions, revealing issues of label collapse and skewing\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Showed that agreement between LLMs is a deceptive measure of semantic consistency, as it mostly reflects the dominance of fallback labels\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Estimated the gap between LLM and human annotations, demonstrating the challenges faced by LLMs in comprehending subtle cues\.

The rest of the paper is structured as follows\. In Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.06940#S2), we provide a literature review of prior work that uses LLMs for annotation and studies social media datasets\. In Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2605.06940#S3), we introduce our dataset, MultiSoc\-4D and present the LLM\-based annotation system and its procedure\. In Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4](https://arxiv.org/html/2605.06940#S4), we perform an empirical analysis of LLM annotation behavior\. In Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5](https://arxiv.org/html/2605.06940#S5), we conduct human evaluation and quantify bias\. The benchmarking and it’s strategies under the biased annotated data is shown in Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2605.06940#S6)\. Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7](https://arxiv.org/html/2605.06940#S7)discusses the results and overall analysis\. The limitation of the study is presented in Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8](https://arxiv.org/html/2605.06940#S8)where we also mentioned our future direction\. The ethical considerations are mentioned in Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9](https://arxiv.org/html/2605.06940#S9)and finally Section[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English10](https://arxiv.org/html/2605.06940#S10)concludes the study\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2Related Work

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.1LLMs for Data Annotation

hasan2024zerofewshotpromptingllmspresent MUBASE as a new benchmark for sentiment analysis in Bengali language with 33,606 social media tweets and posts from Twitter and Facebook which are labeled as positive, negative, and neutral\. The paper analyzes the results of fine\-tuning of transformer models \(such as BanglaBERT\) and their comparison with zero\- and few\-shot prompting via GPT\-4 and Flan\-T5\. The findings revealed that BanglaBERT outperforms the latter approach in terms of performance metrics; in particular, the former method achieved an F1\-score of 69\.39% versus 61\.17% of the latter one\. The research underlines the gaps of previously used datasets in terms of consistency in data annotations and lack of benchmarking for LLMs in the field\. In turn,tan2024largelanguagemodelsdataperform an overview of LLMs as tools for data annotation in independent and human\-in\-the\-loop frameworks\. The authors consider three ways of annotation – zero\-/few\-shot prompting, instruction\-based labeling, and iterative improvement – and claim that the first two approaches allow reaching nearly human\-level accuracy with reduced costs and time\. However, LLMs exhibit sensitivity to prompting, are prone to biases and hallucinations especially in subjective and multilingual contexts, and have not been sufficiently evaluated in low\-resource languages like Bangla yet\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.2Social Media Data and Multi\-Label Classification Datasets

BANHATEraquib\-etal\-2025\-banhateis a dataset for Bangla hate speech classification, with 19,203 YouTube comments classified on a binary and fine\-grained hate basis\. This work assessed LLM and transformers and found LLaMA\-3\.1 \(8B\) with LoRA to give the best performance \(83\.83% F1 for Hate\)\. This dataset overcomes the shortcomings of the previous binary\-only classification by allowing fine\-grained and realistic hate detection\. SentiGOLDsentigold\_2023presented an extensive Bangla dataset with seven thousand sentiments and a total of five different labels from different domains\. It achieved a macro F1 score up to 0\.62 with the use of deep and ML models\. This dataset overcomes the drawbacks of noise in labeling and non\-standard methods of previous data\. BanglaBookkabir\-etal\-2023\-banglabookpresented a very large dataset on Bangla books and their reviews of 158K\. With Bangla\-BERT achieving an F1 score of 93\.31%, this data set outperforms traditional models\. Overcoming the problems of previous limited datasets, this allows for effective product sentiment classification\. The work ofpaul2025analyzingemotionsbanglasocialproposed EmoNoBa, a dataset of 22,698 annotated comments based on six emotions\. The classical model performed better than BiLSTM \(F1 = 38\.69%\), while the best one was AdaBoost\. LIME was used for explanation, solving the problem of interpretability in previous emotion classification datasets\.10129187introduced BE\-CM, a Bangla\-English code\-mixed sentiment classification dataset consisting of 18,074 reviews\. XGBoost with FastText and augmentation obtained an F1 score of 87%, making it more robust to noisy texts and not requiring parallel corpora\.das\-bandyopadhyay\-2010\-labelingcreated a sentence\-level Bengali emotion recognition dataset by labeling sentences based on Ekman’s six emotions, covering 12K sentences\. The most successful method was SVM \(accuracy = 80\.55%\), allowing for nuanced analysis at the level of emotions\.haider\-etal\-2025\-banthcreated BanTH, a dataset of 37,350 transliterated Bangla comments in seven classes of hate speech\. The best model TB\-mBERT reached 77\.36% Macro\-F1, addressing transliteration and multi\-label challenges in real\-world data\. Potrika \(ahmad2022potrikarawbalancednewspaper\) was proposed as a 320K Bangla news dataset\. The accuracy of GRU\+FastText is 92%, but the model’s performance declined substantially with weak supervision, emphasizing the need for manual labeling\. BanglishRevshamael2024banglishrevlargescalebanglaenglishcodemixedconsists of 1\.74M e\-commerce review datasets with Bangla, English, and Banglish text\. BanglishBERT demonstrated F1 scores around 94%, enabling large\-scale sentiment and behavior analysis\. It was proved inislam\-etal\-2022\-emonobathat classic models perform better than deep models in EmoNoBa \(22K comments\) because of the informal nature of the Bangla text compared to the pretrained language model\. BnSentMixalam\-etal\-2025\-bnsentmixis a 20K code\-mixed Bangla and English text dataset\. Transformer\-based models achieved 69\.8% accuracy, improving handling of real\-world mixed\-language content\.article\_hossaincreated a small\-scale emotion\-based Bangla\-English dataset \(2,055 comments\)\. SVM yielded an accuracy rate of 85\.7%, suggesting that emojis play a critical role in sentiment classification\.HASAN2024111107presented a new Bangla ASRB dataset; however, due to its small sample size and narrow scope, it cannot be used for benchmarking\. Although this paper has contributed to fine\-grained sentiment analysis, there is room for improvement\.ISLAM2024100069proposed a dataset named BangDSA containing a large number of 203K comments, along with 15 types of emotions\. CNN\-BiLSTM achieved accuracies of 90\.24% \(15 classes\) and 95\.71% \(three classes\); however, the data collection is unbalanced, and the dataset is not available\. Hossain et al\.hossain\_fahima2025developed a tiny Bangla sentiment dataset consisting of only 3K data with three algorithms: SVM, CNN, and LSTM\. The LSTM model showed better performance \(informal accuracy=80\.3%\),

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.3Annotation Bias and Label Noise

NC\-SentNoB \(elahi\-etal\-2024\-comparative\) is a benchmark consisting of 15,000 Bangla samples labeled for 10 types of noise\.elahi\-etal\-2024\-comparativeconducted experiments using several models such as SVM, BiLSTM, Bangla\-BERT, and MuRIL\. Bangla\-BERT\-Base was found to outperform others in noise detection \(F1: 0\.62\)\. The best sentiment performance was achieved by BanglaBERT \(F1: 0\.75\), but it slightly dropped after noise reduction \(F1: 0\.73\)\. In other words, current techniques for addressing label noise may not work well enough and change the semantic meanings\.choi2024multinewscostefficientdatasetcleansingpresented Multi\-news\+, which is an LLM\-based framework for cleaning datasets to save on labeling costs\. Compared with heuristic methods,choi2024multinewscostefficientdatasetcleansingshowed that Multi\-news\+ led to higher annotation consistency and performance of machine learning models trained on cleaned datasets\. Unfortunately, the framework’s efficacy depends on the prompt design and language models used, being less effective in low\-resource and domain\-specific settings\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.4Closed\-set vs Open\-set Annotation

A cost\-efficient LLM\-based annotation approach for online dataset labeling proposed inelumar2025costawarellmbasedonlinedatasetinvolves balancing between LLM\-based and less costly methods, thus achieving improved annotation efficiency without compromising performance\. However, LLM\-based bias is still present in the work, and ambiguity and domain specificity may negatively affect reliability, with insufficient evaluations performed on controlled datasets\.bogdanov2024nunerentityrecognitionencoderpropose a novel framework called NuNER based on synthetic LLM\-produced labels for pretraining encoders in the task of named entity recognition\. Token\-level performance metrics exceed zero\-shot and supervised models, yet the proposed framework requires high\-quality annotations, is sensitive to labeling bias, and performs only on English datasets\.electronics14142800introduce a unified LLM\-based annotation platform, utilizing Llama 3\.3 with different attribution techniques, including zero\-shot, few\-shot, chain\-of\-thought, and role\-based prompts for generating synthetic datasets and labels\. Although performance is impressive \(99% on synthetic data and up to 92% on AG News\), scalability and context window problems, along with low effectiveness for similar classes, exist\.

The summary of related work on Bangla NLP Datasets and LLM\-based Annotation is shown in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2605.06940#S2.T1)\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 1:Summary of Related Work on Bangla NLP Datasets and LLM\-based Annotation

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3MultiSoc\-4D Dataset

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.1Data Collection

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.1\.1Data Sources

The MultiSoc\-4D dataset was built using posts from different social media websites such as Facebook, Twitter \(X\), YouTube, TikTok, Likee, and Instagram\. The selection of social media platforms was done to maintain heterogeneity in terms of writing style, user base, and discourse\. In contrast to specialized domain corpora, social media posts contain informal speech, slang, code switching, and context\-based meanings\. The posts of the dataset are predominantly written in Bengali, although there might be instances of English words, hashtags, and user tags\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/data_pipeline.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 3:Data collection and preprocessing pipeline for MultiSoc\-4D\.
#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.1\.2Sampling Strategy

A random sampling method was employed in order to reflect the natural distribution without imposing balance among domains\. The aim here is to maintain the natural attributes of social media datasets, including imbalanced classes, diverse topics, and varied languages\. In order to prevent any bias due to oversampling from one particular source, proportional sampling was conducted across all sources\. Content filtering based on categories was not performed manually\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2Dataset Schema

The dataset is annotated across four dimensions:category,sentiment,hateful, andsarcasm\. Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4](https://arxiv.org/html/2605.06940#S3.F4)provides an overview of the dataset schema\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/schema.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 4:Dataset schema of MultiSoc\-4D across four dimensions\.#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2\.1Categories \(8 Classes\)

Each sample is labeled by one of the eight exclusive categories:International, National, Sports, Education, Entertainment, Economy, Technology,andOthers\. TheOtherscategory is used as a catch\-all category in case any of the samples do not fit into defined topics\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2\.2Sentiment \(3 Classes\)

The sentiment is labeled with one of the following labels:Positive, Negative,andNeutral\. The labelNeutralis assigned to the objectively neutral content as well as sentiment\-neutral samples\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2\.3Hate Speech Labeling

This feature is labeled in a binary classification task:YesandNo\. The former is applied to samples containing direct or indirect hate speech directed to individuals or groups\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2\.4Sarcasm Labeling

Similarly to Hate Speech, the Sarcasm feature is also labeled with Yes/No:YesandNo\. It characterizes implicit expressions, which diverge in literal meaning\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.3Annotation Framework

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.3\.1Annotators

The annotation process uses four large language models, namely ChatGPTgilardi2023chatgpt, Geminiteam2023gemini, Claudeanthropic2024claude, and Grokxai2024grok\. These language models are chosen because of their high proficiency in instruction\-following and text classification tasks\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.3\.2Instructions Design

The annotators are provided with identical instructions for annotation purposes\. In line with the closed\-set labeling procedure, each dimension is assigned one label within a predefined set\.

Closed\-Set Rules:For dealing with uncertainties in the annotation process, some default rules are included in the instructions:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•AssignOtherfor unclear categories
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•AssignNeutralfor unclear sentiments
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•AssignNofor unknown cases of hatefulness and sarcasm

Though this set of instructions is designed to facilitate consistency in the annotation process, it inadvertently introduces bias towards conservative labeling\. The Extended Instructions are shown in[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishA](https://arxiv.org/html/2605.06940#A1)\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.4Dataset Statistics

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.4\.1Platform Distribution

The dataset includes contributions from multiple social media platforms, with varying proportions\. This diversity ensures coverage of different discourse styles and interaction patterns\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/source.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 5:Class distribution across all the platforms\.Extra statistics are shown in[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishC](https://arxiv.org/html/2605.06940#A3)\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.5Data Preprocessing

Before starting with the annotation phase, very few preprocessing activities are performed on the collected dataset so that its inherent nature is not lost\. These activities are as follows:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Elimination of duplicates
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Simple normalization of text encoding
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Elimination of non\-text elements like HTML, etc\.

One key point is that the natural linguistic phenomena of the text, including the use of slang, colloquial language, and even minor code mixing, have been preserved to keep the essence of the data alive\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4Empirical Analysis of Annotation Behavior

The complete data set comprising N samples is split into four mutually exclusive subsets, each of which contains an equal number of samples\. One of the subsets is then annotated by one of the LLMs\. In order to conduct inter\-annotator agreement studies, a 20% representative sample from the complete data set — selected from all four subsets — is annotated by all four models independently\. This results in a common validation set, where all four LLMs annotate the same samples\.” Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2605.06940#S4.F6)illustrates the annotation pipeline\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/annotation_pipeline.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 6:LLM\-based annotation pipeline with dataset splitting and multi\-model labeling\.### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.1Label Distribution Collapse

We start with the analysis of label distributions yielded by all LLM annotators in terms of four annotation dimensions: category, sentiment, hateful content, and sarcasm\. According to Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7](https://arxiv.org/html/2605.06940#S4.F7), all models have strong and stable skew towards fallback labels\. To be precise, theOtherlabel accounts for roughly 51\.8% of category labels, andNeutralis predominant in sentiment annotations \(65\.3%\)\. Binary dimensions have an even stronger bias, withNochosen for 95\.4% of hateful content samples and 96\.4% of sarcastic samples\. That means that there is a serious compression of the label space, during which ambiguous input data gets mapped into conservative fallback classes\. Moreover, this tendency is universal for all models considered, which suggests a lack of specificity of the collapse process and its dependence on instruction formulation rather than individual model properties\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/class_distribution.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 7:Label distribution across annotation dimensions showing strong fallback dominance\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.2Cross\-Model Consistency

The following analysis will focus on identifying any consistency between various large language models when presented with the same labeling task under identical conditions\. Even though there is variability in both the architecture used for the LLM and its training dataset, ChatGPTgilardi2023chatgpt, Geminiteam2023gemini, Claudeanthropic2024claude, and Grokxai2024grokall have an extremely similar label distribution on all dimensions\. The underlying structure for all four LLMs appears to be quite similar, characterized by a very high dependence on using fallback labels while avoiding minority semantic labels\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/model-con.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 8:All models reliance on fallback labels\.This consistency shown in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8](https://arxiv.org/html/2605.06940#S4.F8)indicates that closed\-set annotation induces a shared behavioral regime across heterogeneous LLMs, effectively overriding model\-level variability\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.3Inter\-Annotator Agreement Analysis

We quantify inter\-model agreement using Fleiss’ Kappa over a randomly sampled 20% subset of the dataset, where all four models annotate the same instances\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.3\.1Overall Fleiss’ Kappa

It is evident that there is great variation in the level of agreement between annotators based on annotation dimension, as illustrated in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.06940#S4.T2)\. For instance, we have moderate agreement when it comes to category \(κ≈0\.41\\kappa\\approx 0\.41\) and sentiment \(κ≈0\.55\\kappa\\approx 0\.55\), but the level of agreement is rather poor in the case of hateful content \(κ<0\.39\\kappa<0\.39\)\. There is almost no agreement at all in terms of sarcasm \(κ≈−0\.001\\kappa\\approx\-0\.001\)\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 2:Fleiss’ Kappa Scores Across Annotation DimensionsThis discrepancy shown in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.06940#S4.T2)suggests that LLMs exhibit apparent consistency only in coarse\-grained or structurally simple tasks, while failing to maintain alignment in semantically complex or context\-dependent dimensions\.

#### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.3\.2Class\-wise Agreement

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 3:Class\-wise Agreement Metrics Across Four Annotators\. \(FAR = Full Agreement Ration, AAS = Average Agreement Strength\.\)The findings from Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2605.06940#S4.T3)reveal a clear bias towards negative or neutral default predictions\. On the Sarcasm and Hatefulness dimensions, the models tend to reach agreement on whether or not the feature is present \(choosing ”No”\) almost consistently; however findings suggest that while the annotators \(LLM models\) effectively filter ”standard” content, the detection of specific topics and sentiments remains highly subjective and prone to disagreement\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.4Agreement vs Label Frequency

We further analyze the relationship between label frequency and agreement\. A strong positive correlation is observed between class prevalence and inter\-annotator agreement\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/kappa.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 9:Inter\-annotator agreement \(Fleiss’ Kappa\) across Label Frequency\.From Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9](https://arxiv.org/html/2605.06940#S4.F9)it is clear that high\-frequency fallback labels consistently exhibit higher agreement, while low\-frequency semantic classes show unstable and near\-random agreement behavior\. This suggests that agreement metrics in closed\-set LLM annotation are heavily confounded by label distribution imbalance rather than reflecting true inter\-model understanding\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5Human Evaluation and Bias Quantification

This section compares LLM\-generated annotations with human annotations on a stratified random subset of 500 samples \(Which is called the GOLD set\)\. The goal is to quantify distributional bias and identify systematic deviations in LLM labeling behavior\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.1Human Annotation Setup

A subset of 500 samples is independently annotated by human annotators following the same four\-dimensional schema used for LLM annotation\. Each instance is labeled for category, sentiment, hateful content, and sarcasm according to the original annotation guidelines\. The instructions for the human and the LLMs were same for the annotation criteria\. The instructions are clearly mentioned in[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishA](https://arxiv.org/html/2605.06940#A1)\.

This subset serves as a calibration set to evaluate distributional and error\-level divergence between human judgments and LLM\-generated annotations\. Actually this is working as a validation of the problem raised in this study\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.2Distribution Comparison

We compare the label distributions produced by LLMs \(averaged across models\) and human annotations on the 500\-sample subset\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 4:Comparative Label Distribution: Individual LLMs vs\. Human Gold Standard \(N=500\)\. Cl = ClaudeAI, GPT = ChatGPT, Ge = Gemini, Gr = Grok, H = Human\.\\cellcolorgray\!15Dim\.\\cellcolorgray\!15Class\\cellcolorgray\!15Cl\.\\cellcolorgray\!15GPT\\cellcolorgray\!15Ge\.\\cellcolorgray\!15Gr\.\\cellcolorcyan\!10H\.Category\\cellcolorblue\!10Other\\cellcolorblue\!10169\\cellcolorblue\!10427\\cellcolorblue\!10390\\cellcolorblue\!10441\\cellcolorblue\!10291National191475127\\cellcolorcyan\!1096Entertainment5471211\\cellcolorcyan\!1029International23875\\cellcolorcyan\!1024Technology25242\\cellcolorcyan\!1020Economy194116\\cellcolorcyan\!1018Sports133103\\cellcolorcyan\!1012Education62155\\cellcolorcyan\!1010Sentiment\\cellcolorblue\!10Neutral\\cellcolorblue\!10173\\cellcolorblue\!10446\\cellcolorblue\!10460\\cellcolorblue\!10470\\cellcolorblue\!10316Negative231232919\\cellcolorcyan\!1090Positive96311111\\cellcolorcyan\!1094Hateful\\cellcolorblue\!10No\\cellcolorblue\!10474\\cellcolorblue\!10491\\cellcolorblue\!10489\\cellcolorblue\!10490\\cellcolorblue\!10454Yes2691110\\cellcolorcyan\!1046Sarcasm\\cellcolorblue\!10No\\cellcolorblue\!10467\\cellcolorblue\!10488\\cellcolorblue\!10498\\cellcolorblue\!10493\\cellcolorblue\!10465Yes331227\\cellcolorcyan\!1035The results from Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4](https://arxiv.org/html/2605.06940#S5.T4)indicate that LLMs are currently ”risk\-averse” classifiers\. Their tendency to favor Neutral, Non\-Hateful, and Non\-Sarcastic labels leads to a sanitized version of the data that lacks the granularity and sensitivity of human judgment\. This poses a significant challenge for using LLMs in automated content moderation and nuanced social media analysis\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.3Bias Ratio Analysis

In order to measure how far apart the outputs from the model are compared to the ground truth human labels, we introduce the Bias Ratio metric, defined by

Bias Ratio=LLM Label Frequency \(Avg\.\)Human Label Frequency\.\\text\{Bias Ratio\}=\\frac\{\\text\{LLM Label Frequency \(Avg\.\)\}\}\{\\text\{Human Label Frequency\}\}\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(1\)
As illustrated in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5](https://arxiv.org/html/2605.06940#S5.T5), the analysis of the Bias Ratio demonstrates a pattern of inherent structural bias within the LLM labels\. The following table shows that fallback classes are significantly overrepresented \(Bias Ratio\>1\\text\{Bias Ratio\}\>1\) across all dimensions while semantic and minority classes such as Sarcasm, Hateful Content, and Sports, on the other hand, are persistently underrepresented \(Bias Ratio<1\\text\{Bias Ratio\}<1\)\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 5:Bias Ratio Across All Annotation Dimensions \(N=500N=500\)\. H = Human, BR = Bias Ratio\.\\cellcolorgray\!15Dim\.\\cellcolorgray\!15Class\\cellcolorgray\!15LLM Avg\.\\cellcolorgray\!15H\.\\cellcolorgray\!15BR\.Category\\cellcolorred\!25Other356\.75291\\cellcolorred\!251\.22National79\.00960\.82Entertainment21\.00290\.72International10\.75240\.45Technology8\.25200\.41Economy10\.00180\.56Sports7\.25120\.60Education7\.00100\.70Sentiment\\cellcolorred\!25Neutral387\.25316\\cellcolorred\!251\.23Negative75\.50900\.84Positive37\.25940\.40HatefulYes14460\.30\\cellcolorred\!25No486454\\cellcolorred\!251\.07SarcasmYes13\.5350\.38\\cellcolorred\!25No486\.5465\\cellcolorred\!251\.04Bias Ratios provide an estimate of how much ”filtering” LLMs do\. For sarcasm \(yes\) and hateful \(yes\), bias ratios are 0\.25 and 0\.20 respectively, which shows that LLMs pick up merely a quarter or fifth of what is detected by human beings\. On the other hand, Other and neutral have bias ratios of 1\.29 and 1\.27, respectively, which show that they push away labels from more informative classes\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.4Model to Human Distribution

In order to calculate the Model to Human Distribution we applied the formula :

M2H Distribution=LLM Label CountHuman Label Count\.\\text\{M2H Distribution\}=\\frac\{\\text\{LLM Label Count\}\}\{\\text\{Human Label Count\}\}\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(2\)
\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 6:Model to Human\(M2H\) Distributional Ratios \(Values closer to 1\.00 are better\)\. Cl = ClaudeAI, GPT = ChatGPT, Ge = Gemini, Gr = Grok, H = Human\.Based on the distributional ratios and the raw counts provided in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2605.06940#S5.T6),Claudeemerges as the most effective annotator among the four models, despite its own specific biases\. Claudeanthropic2024claudeis the only model that consistently identifies long\-tail semantic categories with frequencies comparable to humans\. For example, its ratios for International \(0\.96\), Economy \(1\.06\), Sports \(1\.08\), and Sarcasm \(0\.94\) are remarkably close to the human baseline of 1\.00\. While it tends to over\-detect National and Negative content, it avoids the catastrophic label space compression seen in the other models\.

From Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6](https://arxiv.org/html/2605.06940#S5.T6)GPT, Gemini, and Grok are most conservative or fallback biased\. These models show severe over\-representation of fallback labels and extreme under\-representation of informative classes\. Specifically:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Fallback Labels:These models exhibitOtherratios ranging between 1\.34 and 1\.52\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•GPT:Captures only 10% of Technology and 20% of Hateful content identified by humans\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Gemini:Is nearly “blind” to Sarcasm, identifying only 2 instances where humans found 35 \(Ratio: 0\.06\)\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Grok:Exhibits the highest reliance on theOthercategory \(441 instances vs\. 291 human\) and the Neutral sentiment \(Ratio: 1\.49\)\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.5False Negative Analysis

We further analyze false negatives, defined as instances where LLMs assign fallback labels \(e\.g\.,No,Neutral,Other\) while human annotators identify them as semantically meaningful classes\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 7:False Negative \(FN\) Impact on Nuanced ClassesA large proportion of missed cases is observed in sarcasm and hateful content detection, where the FN rate exceeds 75% \(Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7](https://arxiv.org/html/2605.06940#S5.T7)\)\. This suggests that LLMs exhibit a ”conservative bias,” systematically failing to detect implicit, context\-dependent, or pragmatically encoded expressions which are readily apparent to human readers\. The False Negative Rate of 79% and 75% respectively for Hateful Content and Sarcasm annotations represents one of the most important results obtained from this experiment\. In effect, this result indicates that the use of LLMs for language annotation purposes seems almost blind to the most intricate characteristics of human communication\. On top of this, the FN Rate in specialized cases \(57\.4%\) reinforces the hypothesis that LLMs focus on a generalist interpretation rather than specialized information\. The implication of this result for scientists is clear: using LLMs to annotate data would mean to obtain ”cleansed” data\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.6Implications for Annotation Reliability

Analysis suggests that LLM\-powered consistency is an “agreement illusion,” where high reliability metrics \(e\.g\., Fleiss’ Kappa\) may emerge due to thedominance of the fallback labelinstead of semantic consistency with human perception\.

Thus, a key issue in data engineering becomessemantic space compression\. Closed\-set LLM\-based annotations are biased towards the underrepresentation of less frequent categories or subtle occurrences like irony or implicit toxicity\. As a result, the usage of these “silver\-standard” datasets can lead to models incapable of recognizing rare but essential language signals\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6Benchmarking Under Biased Annotations

This section evaluates the impact of LLM\-induced annotation bias on downstream benchmarking tasks\. We analyze how training and evaluation under LLM\-generated labels affects model behavior\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.1Experimental Setup

We evaluate the impact of LLM\-induced annotation bias using the following configuration:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Data and Supervision: Models are trained on the full 58K LLM\-annotated MultiSoc\-4D dataset\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Task Setting: We performmulti\-label classificationacross the four analyzed dimensions\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Model Selection: Our benchmark includes a wide array of models: - \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English–LLMs: Qwen2\.5 \(0\.5B\-7B\)qwen2024qwen25, LLaMA3\.2 \(1B\-3B\), LLaMA3\.1 \(8B\), TinyLLaMA 1\.1B, Aya Expanse 8B, Phi\-3/4, Gemma variants and Mistraljiang2023mistralvariants\. - \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English–Transformers: XLM\-RoBERTa, MuRIL, RemBERT, BanglaBERT, DeBERTa, XLNet, Electra, FinBERT, BanglaT5, and IndicBERT\.wolf2020transformers - \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English–Traditional ML: Linear/Logistic Regression, Ridge/Lasso Classifier, SVM, KNN, Random Forest, Decision Tree, Gradient Boost, AdaBoost, Multinomial/Gaussian NB, LightGBM, XGBoost, and CatBoost\.bishop2006pattern
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Hyperparameters: Training is conducted using PyTorch and Hugging Face with LoRA adaptation\. We use a learning rate of2e−52e\-5, a batch size of 32, and 1 training epoch\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Hardware: All experiments are performed on an NVIDIA RTX 4070 Ti Super GPU\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Evaluation Metrics:We evaluated the performance of several models on this dataset using standard text classification metrics, includingAccuracy,Precision,Recall, and theF1\-score\. While these metrics provide a broad overview of model efficacy, our benchmark study primarily focuses on theMacro\-F1 score\. This choice was made to rigorously analyze class\-wise performance and mitigate the impact of class imbalance \(bias\)\. By averaging the F1\-scores of each class independently, the Macro\-F1 metric ensures that the model’s ability to identify minority classes is treated with equal importance to the majority classes\. The Macro\-F1 score is defined as the arithmetic mean of the individual F1\-scores for all classes, as shown in Equation[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3](https://arxiv.org/html/2605.06940#S6.E3): Macro\-F1=1N∑i=1NF1i\\text\{Macro\-F1\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}F1\_\{i\}\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(3\)whereNNrepresents the total number of classes andF1iF1\_\{i\}is the F1\-score for theii\-th class, calculated as: F1i=2×Precisioni×RecalliPrecisioni\+RecalliF1\_\{i\}=2\\times\\frac\{\\text\{Precision\}\_\{i\}\\times\\text\{Recall\}\_\{i\}\}\{\\text\{Precision\}\_\{i\}\+\\text\{Recall\}\_\{i\}\}\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(4\)

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.2Model Performance

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 8:Performance Comparison Across Model ArchitecturesFrom the Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8](https://arxiv.org/html/2605.06940#S6.T8), it is evident that Monolingual and Multilingual transformers \(BanglaBERT and IndicBERT\) perform better compared to Instruction\-Tuned LLMs in detecting subtle aspects of languages\. The LLMs exhibit a very high accuracy, but in terms of the Macro F1 score, they perform poorly especially on Sarcasm\. It is likely that these models cannot detect the minority classes properly due to their complexity\. On the other hand, the best\-performing model in the detection of critical aspects such as Hatefulness is the BanglaBERT with a Macro F1 score of 0\.77\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.3Human\-Aligned Evaluation

To assess the impact of label space compression on downstream model performance, we evaluate our top\-performing models on the human\-annotated gold subset\. Given that Claude exhibited the highest distributional alignment with human labels in our previous analysis, we compare models trained on Claude\-annotated data against the same models evaluated on human gold\-standard labels\. The results, detailed in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9](https://arxiv.org/html/2605.06940#S6.T9), show that models evaluated against human annotations generally exhibit a degradation in performance—particularly in accuracy—compared to their evaluation on LLM\-generated labels\. This confirms that training on collapsed label distributions reduces model sensitivity to minority semantic phenomena and reinforces conservative prediction behaviors that diverge from actual human judgment\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 9:Model Performance: Claude\-Annotated vs\. Human Gold Standard \(N=500\)The evaluation reveals a critical performance gap between LLM\-derived and human\-derived performance metrics\. In the Category and Sentiment dimensions, the accuracy is significantly higher when evaluated against humans \(reaching 0\.54 and 0\.63 respectively\), suggesting that the models’ ”reasoning” aligns more closely with human labels than the sanitized Claude annotations\. However, in the high\-stakes dimensions of Hateful and Sarcasm, the performance remains nearly identical across both datasets, with high accuracy \(≈0\.92\\approx 0\.92\) and low Macro F1 \(≈0\.44\\approx 0\.44\)\. This indicates that even the best\-performing models \(Mistral Nemo and Mistral3 7B\) are struggling with the ”Accuracy\-F1 Paradox”\. They successfully predict the majority label \(”No”\) but fail to reliably capture the nuance identified by humans, proving that even a ”human\-aligned” LLM like Claude still provides a training signal that is insufficient for mastering complex, context\-dependent semantic phenomena\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6\.4Bias Sensitivity Across Models

Finally, we analyze the response of diverse model architectures to systematic annotation bias\. As visualized in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English10](https://arxiv.org/html/2605.06940#S6.F10), we find that all models—ranging from small\-scale Transformers to large instruction\-tuned LLMs—exhibit nearly identical sensitivity patterns\. While models achieve high accuracy on fallback\-dominated tasks, there is a universal failure to generalize to human\-identified semantic classes\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/macsen.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 10:Macro\-F1 sensitivity accross modelsThis uniformity indicates that annotation bias is an architectural\-agnostic phenomenon; it propagates through the training pipeline and systematically restricts downstream model behavior, regardless of the specific model choice\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7Discussion

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7\.1Label Ambiguity Analysis

However, the difference in the human vs\. LLM tagging can be attributed to the fact that social media language is inherently ambiguous\. While humans use contextual knowledge to decipher the meaning in ambiguous phrases, LLMs find it difficult to comprehend pragmatically coded expressions\. The analysis we have conducted reveals that the response of an LLM to ambiguous prompts is that of a “majority\-class consensus\.”

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7\.2Other\-Class Overloading

One interesting result from the MultiSoc\-4D dataset was the ”overloading” of the ”Other” category\. Although humans carefully classified documents into different categories including Technology, International, and Education, LLMs such as GPT and Grok categorized far too many samples into the ”Other” category\. The label space compression essentially filters out the long\-tail data variety within the dataset\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7\.3Neutral vs Non\-Sentiment Cases

We notice an important conflation between ”Neutral” sentiment and ”Non\-Sentiment” instances\. LLMs display a “conservative bias” and are more likely to classify the data as ”neutral,” even when there are triggers for sentiments embedded in the data\. The accuracy\-F1 paradox of the Sentiment axis is supported by the observation that the high accuracy of the Sentiment axis is due to the prevalence of the Neutral label\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7\.4Error Analysis

From our error analysis, we conclude that False Negatives for Sarcasm and Hate Speech are the main reasons behind the failure of the model\. The relatively high False Negative rate \(over 75% in some models\) implies that large language models tend to underestimate implicit toxicity, which results in “sterilizing” those data samples that are most socially linguistically complex\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8Limitations and Future Work

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8\.1Closed Label Schema Limitation

The limitations of the present study are confined to its closed\-set labeling methodology, where the models are forced to select labels from a fixed set of labels\. Such limitations might be causing the “dominance of fallback labels” phenomenon because the models cannot indicate their uncertainty or propose other semantic labels\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8\.2Instruction\-Induced Bias

The behavior of the instruction\-tuned models employed for annotations \(such as LLaMA, Qwen, Phi\) is strongly affected by the safety alignment within them\. It appears that such “safety\-first” training affects the models’ unwillingness to classify content into Hateful and Sarcastic categories, thus causing the underrepresentation of the two minority classes in the silver standard data set\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English8\.3Future Direction: Open\-Set Labeling \(NaN\)

For tackling the issue of label space reduction, future research will consider open set labeling with the use of ”NaN” labels or labels denoted by ”Uncertain”\. The objective is to enable models to provide natural language explanations or indicate when a label does not fit the schema\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9Ethical Considerations

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9\.1Data Privacy

The MultiSoc\-4D dataset is derived from publicly available social media text\. To protect user privacy, all personally identifiable information \(PII\) has been removed, and the research adheres to standardized academic protocols for large\-scale data analysis\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9\.2Content Sensitivity

This research involves the analysis of Hateful Content and Sarcasm\. While essential for developing robust safety filters, exposure to such content during the human annotation process was managed with care to minimize psychological impact on the annotators\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English9\.3Bias and Fairness

We acknowledge that the LLMs used for benchmarking—such as Claude, Gemini, and GPT—contain inherent biases from their pre\-training data\. By comparing these against human gold standards, this study explicitly aims to highlight and quantify these biases to foster more fair and transparent automated annotation pipelines\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English10Conclusion

MultiSoc\-4D presents a Bengali dataset spanning multiple platforms where it becomes evident that label collapse is a system failure for closed\-set LLM annotation\. Our analysis of four state\-of\-the\-art LLMs on a common validation set proves that this is due to the instruction paradigm and not because of any model\-specific shortcomings\. We measure this phenomenon using benchmarking on a human reference dataset of 500 instances\. LLMs are able to catch less than one\-quarter of the offensive and sarcastic posts that humans mark as such, having false negatives at 79% and 75%, respectively\. However, the high agreement across models seen throughout dimensions can be dismissed as a misleading statistic due to their tendency towards convergence on fallback labels rather than any actual mutual semantic comprehension\. As illustrated by Fleiss’κ\\kappascore of≈−0\.001\\approx\-0\.001for sarcasm and nearly 96% agreement on ”No,” such an illusion becomes statistically concrete\. Cross\-benchmarking down the pipeline with more than 40 models, including traditional ML, transformer\-based models, and instruction\-tuned models, reveals that the bias is not remedied during training\. Instead, it perpetuates\. These models acquire the same conservative tendencies, achieving high precision on majority fallback classes yet failing on the minority classes crucial for downstream tasks such as content moderation and hate speech detection\. MultiSoc\-4D is released as a diagnostic tool that explicitly measures, rather than conceals, annotation bias\. While human reference data quantifies the gap, it is not a final correction\.

## References

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix AAnnotation Guidelines

### Instruction

This section details the instructions provided to annotators for the Bangla Comment Dataset to ensure consistency and high\-quality labeling\.

#### Annotation Schema

Each comment is evaluated across four distinct dimensions:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Category:\(8 classes\) International, National, Entertainment, Education, Sports, Technology, Economy, Other\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Sentiment:\(3 classes\) Positive, Negative, Neutral\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Hatefulness:\(Binary\) Yes, No\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Sarcasm:\(Binary\) Yes, No\.

#### Standardized Examples

The following table provides representative examples for various annotation dimensions\.

#### Full Annotation Samples

Annotators are required to populate the fields in the sequence:Category→\\rightarrowSentiment→\\rightarrowHateful→\\rightarrowSarcasm\.

#### Annotation Rules & Conflict Resolution

1. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1\.Default Assignment \(Confusion Rule\):In cases of high ambiguity or lack of context, annotators must default to:Other, Neutral, No, No\.
2. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.Linguistic Nuance:Criticism of an idea is labeled asHateful: No, whereas attacks on identity or personhood are labeled asHateful: Yes\.
3. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.Sarcasm Identification:Sarcasm is only labeledYesif the intended meaning is the opposite of the literal text \(irony\)\.

### Annotation Prompt

LLM Annotation PromptRole:Act as a data annotator specializing in Bengali text classifier\.Input:You are provided with:\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•A CSV file containing user comments\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•An instruction document \(Instruction\.docx\) defining labeling criteria\.Task:\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Parse the text in the\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishcommentscolumn\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Annotate each entry following the hierarchy and definitions in\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishInstruction\.docx\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Classify data into standardized labels defined in the instruction file\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Prioritize neutral labeling where appropriate\.Output Requirements:\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Return a CSV file containing only:\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English–Original comments\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English–Generated labels\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Ensure no extra commentary or metadata is included\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Assign your name in the\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAnnotator Namefield\.Final Instruction:Provide the output as a downloadable CSV file only\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix BDataset Examples

Datasets are diverse but the annotations seem noisy\. The sample of the dataset is shown in the Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English10](https://arxiv.org/html/2605.06940#A2.T10)\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 10:Sample Annotated Dataset
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix CAdditional Dataset Visualization

The text visualization presented in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English11](https://arxiv.org/html/2605.06940#A3.F11)which shows the range of the text present in the dataset as ”Comment”\.

![Refer to caption](https://arxiv.org/html/2605.06940v1/text-length.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 11:The dataset contains both short and long text samples, ranging from brief comments to longer conversational statements which is visually shown in\. This variation reflects realistic usage patterns and introduces additional complexity for annotation\.
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix DExtended Results Tables

The benchmarking was categorized by 4 types of model\-pipeline selection\. Those are :

1. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1\.Instruction\-Tuned Large Language Models
2. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2\.Multi\-Lingual Tranformers \(LLM\)
3. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.Mono\-Lingual Transformers \(LLM\)
4. \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.Traditional Machine Learning Model

The performance benchmark of the instruction\-tuned llms are shown in the Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English11](https://arxiv.org/html/2605.06940#A4.T11)\. For transformer based mono and multi\-lingual models performance are presented in Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English13](https://arxiv.org/html/2605.06940#A4.T13)\. And the Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English12](https://arxiv.org/html/2605.06940#A4.T12)shows the benchmarking of traditional machine learning models\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 11:Performance Comparison of Instruction\-Tuned Models Across Four Dimensions\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 12:Performance Comparison of Traditional Machine Learning Models Across Four Dimensions\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 13:Performance Comparison of MLM Models Across Four Dimensions
## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishAppendix EResults Visualization

![Refer to caption](https://arxiv.org/html/2605.06940v1/llmsen.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 12:Comparative Radar Charts illustrating Instruction\-Tuned Models Sensitivity across Accuracy and Macro F1 dimensions\.![Refer to caption](https://arxiv.org/html/2605.06940v1/mlmsen.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 13:Comparative Radar Charts illustrating Multi/Monolingual Models Sensitivity across Accuracy and Macro F1 dimensions\.![Refer to caption](https://arxiv.org/html/2605.06940v1/mlsen.jpg)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 14:Comparative Radar Charts illustrating Traditional Machine Learning Models Sensitivity across Accuracy and Macro F1 dimensions\.
MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Similar Articles

Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

Submit Feedback

Similar Articles

Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training