YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

arXiv cs.CL Papers

Summary

This paper details the YEZE system for SemEval-2026 Task 9, which detects online polarization in 22 languages using a heterogeneous ensemble of XLM-RoBERTa and mDeBERTa models.

arXiv:2605.06231v1 Announce Type: new Abstract: This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmentation, and class weighting to improve classification performance under severe label imbalance. Our findings indicate that independent task modeling combined with class weighting is more effective.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:26 AM

# YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling
Source: [https://arxiv.org/html/2605.06231](https://arxiv.org/html/2605.06231)
###### Abstract

This paper presents our system forSemEval\-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification\. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM\-RoBERTa\-large and mDeBERTa\-v3\-base\. We investigate techniques such as multi\-task learning, translation\-based data augmentation, and class weighting to improve classification performance under severe label imbalance\. Our findings indicate that independent task modeling combined with class weighting is more effective\.

YEZE at SemEval\-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

Fengze Guo Yue ChangUniversity of Tübingen\{fengze\.guo, yue\.chang\}@student\.uni\-tuebingen\.de

## 1Introduction

Social media platforms have expanded public communication worldwide, but they have also amplified online polarization across languages and culturesHoward \([2019](https://arxiv.org/html/2605.06231#bib.bib13)\)\. Polarized content differs not only in targets and stance but also in its linguistic manifestations, making automatic identification important for large\-scale analysis of social dynamics and for assisting downstream moderation and policy research\.

This paper presents our system forSemEval\-2026 Task 9Naseem et al\. \([2026a](https://arxiv.org/html/2605.06231#bib.bib18)\), a shared task dedicated to identifying online polarization in multilingual social media\.111The code is available at[https://github\.com/FezeGo/SemEval\-2026\-Task9\-Polar](https://github.com/FezeGo/SemEval-2026-Task9-Polar)\.The shared task comprises three subtasks: \(1\) binary classification to determine whether a post is polarized, \(2\) multi\-label classification of the polarization target type, and \(3\) multi\-label identification of manifestation categories\. The official datasetNaseem et al\. \([2026b](https://arxiv.org/html/2605.06231#bib.bib19)\)consists of multilingual social media posts in 22 languages: Amharic, Arabic, Bengali, Burmese, Chinese, English, German, Hausa, Hindi, Italian, Khmer, Nepali, Odia, Persian, Polish, Punjabi, Russian, Spanish, Swahili, Telugu, Turkish, and Urdu\. Subtasks 1 and 2 cover all 22 languages, while Subtask 3 excludes Italian, Polish, Russian, and Burmese\. Detailed input/output schemas and language coverage are provided in Appendices[A](https://arxiv.org/html/2605.06231#A1)and[B](https://arxiv.org/html/2605.06231#A2)\. All subtasks are evaluated using the macro\-averaged F1 score \(Macro\-F1\)\.

Our system covers all subtasks, modeling each as an independent problem\. Our final system combines two complementary multilingual encoders \(XLM\-RoBERTa\-largeConneau et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib5)\)and mDeBERTa\-v3\-baseHe et al\. \([2023](https://arxiv.org/html/2605.06231#bib.bib12)\)\) in a weighted ensemble optimized on the development set\. For the multi\-label subtasks, we use binary relevance with weighted binary cross\-entropy to mitigate severe label imbalance\.

This task poses several challenges for multilingual modeling\. We observe substantial cross\-lingual variation in label distributions, with strong prior shift in Subtask 1 and pronounced sparsity in Subtasks 2 and 3, making minority\-label learning and calibration central under Macro\-F1\. Our experiments also suggest that shared multi\-task training can introduce negative transfer when sparse fine\-grained labels compete with the dominant binary objective, making fine\-grained prediction less stable than coarse binary detection\. Together, these observations indicate that fine\-grained label sparsity, cross\-task inconsistency, and negative transfer remain the main bottlenecks, especially for manifestation prediction\.

In the official evaluation, we ranked in the Top 10 for 11/22, 16/22, and 17/18 languages in Subtasks 1, 2, and 3, respectively\. The detailed per\-language rankings are presented in Appendix[C](https://arxiv.org/html/2605.06231#A3)\.

## 2Background

Online polarization is an intricate socio\-technical phenomenon\. Beyond algorithmic amplification within “information bubbles”Build Up \([2025](https://arxiv.org/html/2605.06231#bib.bib3)\), social network structures often foster echo chambers where bipolar discourse is costly and infrequentGarimella et al\. \([2018](https://arxiv.org/html/2605.06231#bib.bib10)\)\. Crucially, research suggests that mere exposure to opposing viewpoints can paradoxically increase polarizationBail et al\. \([2018](https://arxiv.org/html/2605.06231#bib.bib2)\), underscoring the deep\-seated nature of digital rifts\. These divisions have severe offline consequences, such as intensifying ethnic mobilization and marginalizing vulnerable voices in conflict zonesAli et al\. \([2025](https://arxiv.org/html/2605.06231#bib.bib1)\)\. In NLP, early work largely focused on identifying predictive features for hate speech and offensive contentWaseem and Hovy \([2016](https://arxiv.org/html/2605.06231#bib.bib26)\); Davidson et al\. \([2017](https://arxiv.org/html/2605.06231#bib.bib6)\)\. Recent frameworks like HateCheckRöttger et al\. \([2021](https://arxiv.org/html/2605.06231#bib.bib22)\)have introduced functional testing to evaluate model robustness against complex linguistic phenomena such as negation and counter\-speech\.

Recent benchmarksNaseem et al\. \([2026b](https://arxiv.org/html/2605.06231#bib.bib19)\)suggest that while binary detection is relatively mature, fine\-grained target and manifestation prediction remains substantially more challenging\. In addition, LLM\-based approaches have been shown to be less reliable than specialized supervised models for these fine\-grained labels\. Motivated by these limitations, we focus on supervised multilingual encoder\-based models, which provide more stable and controllable optimization in low\-resource and imbalanced settings\.

Such models are typically based on Transformer architecturesVaswani et al\. \([2017](https://arxiv.org/html/2605.06231#bib.bib25)\), with pre\-trained variants such as BERTDevlin et al\. \([2019](https://arxiv.org/html/2605.06231#bib.bib8)\)offering strong contextual representations for downstream classification tasks\.

Within this paradigm, a common strategy is to jointly model related objectives using multi\-task learning \(MTL\)Caruana \([1997](https://arxiv.org/html/2605.06231#bib.bib4)\)\. However, prior work has shown that MTL can suffer from negative transfer and task interference, particularly when label distributions are highly imbalanced or heterogeneousYu et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib28)\)\. In contrast, we adopt independent task modeling combined with ensemble learning, which proves to be more robust under severe label sparsity\.

## 3System Overview

Our system targets multilingual polarization modeling across 22 languages under severe label imbalance\. We follow three core design principles: \(i\) per\-subtask specialization to ensure high accuracy on each individual task, \(ii\) imbalance\-aware optimization to tackle class imbalance in multi\-label subtasks, and \(iii\) heterogeneous ensembling to enhance robustness across languages and events\.

### 3\.1Model Architecture

We treat the three subtasks as independent classification problems, with separate models trained for each\. Our architecture leverages two complementary multilingual encoders to maximize performance across diverse languages and subtasks:

#### XLM\-RoBERTa\-large\.

We select XLM\-RoBERTa\-large \(XLM\-R\) as the primary backbone because it is a strong multilingual encoder with broad cross\-lingual coverage and stable transfer performance in prior workConneau et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib5)\)\.

#### mDeBERTa\-v3\-base\.

We also employ mDeBERTa\-v3\-base \(mDeBERTa\), which uses disentangled attention to encode content and relative position separatelyHe et al\. \([2023](https://arxiv.org/html/2605.06231#bib.bib12)\)\.

Together, these two encoders provide complementary representation spaces and tokenization behaviors, which are then combined through ensembling to increase robustness across different languages and tasks\.

### 3\.2Imbalance\-aware Optimization

As shown in Figure[1](https://arxiv.org/html/2605.06231#S3.F1), the polarized rate varies substantially across languages, indicating strong cross\-lingual prior shift\. This motivates the use of imbalance\-aware optimization, especially for the multi\-label subtasks where sparsity is even more severe\.

![Refer to caption](https://arxiv.org/html/2605.06231v1/x1.png)Figure 1:Positive rates for Subtask 1 across languages in the merged train\+dev set\.For Subtasks 2 and 3, we adopt a Binary Relevance \(BR\) formulationRead et al\. \([2009](https://arxiv.org/html/2605.06231#bib.bib21)\), where multi\-label predictions are decomposed into independent binary classification tasks\. This allows each label to be treated as a separate problem, with a sigmoid activation applied to each logit to yield independent probabilities:

y^c=P​\(c=1∣x\)=σ​\(zc\)∈\[0,1\]\.\\hat\{y\}\_\{c\}=P\(c=1\\mid x\)=\\sigma\(z\_\{c\}\)\\in\[0,1\]\.
To handle severe class imbalance, we utilize Weighted Binary Cross\-Entropy \(WBCE\) loss, a common approach in cost\-sensitive and imbalanced learningElkan \([2001](https://arxiv.org/html/2605.06231#bib.bib9)\); He and Garcia \([2009](https://arxiv.org/html/2605.06231#bib.bib11)\), where the weight for each class is inversely proportional to the frequency of its occurrences in the training data:

ℒ\(yc,y^c\)=−\[wc⋅yc​log⁡\(y^c\)\+\(1−yc\)log\(1−y^c\)\],\\begin\{split\}\\mathcal\{L\}\(y\_\{c\},\\hat\{y\}\_\{c\}\)=\-\[&w\_\{c\}\\cdot y\_\{c\}\\log\(\\hat\{y\}\_\{c\}\)\+\\\\ &\(1\-y\_\{c\}\)\\log\(1\-\\hat\{y\}\_\{c\}\)\],\\end\{split\}where the weightwcw\_\{c\}is computed as:

wc=Nneg,cmax⁡\(Npos,c,1\)\.w\_\{c\}=\\frac\{N\_\{\\text\{neg\},c\}\}\{\\max\(N\_\{\\text\{pos\},c\},1\)\}\.Here,Nneg,cN\_\{\\text\{neg\},c\}andNpos,cN\_\{\\text\{pos\},c\}represent the number of negative and positive instances for each label, respectively\.

We also evaluated the effectiveness of WBCE relative to other loss functions; a detailed comparison is provided in Section[5\.3](https://arxiv.org/html/2605.06231#S5.SS3)\.

### 3\.3Heterogeneous Ensemble Strategy

For our final submission, we ensemble the two backbones using weighted probability averaging\. For each labelccand input postxx, the ensembled posterior is:

P¯​\(c=1∣x\)=α​PXLM\-R​\(c=1∣x\)\+\(1−α\)​PmDeBERTa​\(c=1∣x\),\\begin\{split\}\\bar\{P\}\(c=1\\mid x\)=&\\alpha\\,P\_\{\\text\{XLM\-R\}\}\(c=1\\mid x\)\+\\\\ &\(1\-\\alpha\)\\,P\_\{\\text\{mDeBERTa\}\}\(c=1\\mid x\),\\end\{split\}whereα\\alphais selected on thedevset to maximize Macro\-F1 \(we useα=0\.7\\alpha=0\.7\)\.

### 3\.4Pipeline Summary

Our training and inference pipeline is:

1. \(i\)fine\-tune XLM\-R and mDeBERTa separately for each subtask,
2. \(ii\)tune the ensemble weightα\\alphaon thedevset,
3. \(iii\)generate test predictions by weighted ensembling and a global thresholdτ=0\.5\\tau=0\.5\.

Experimental details are provided in Section[4](https://arxiv.org/html/2605.06231#S4)\.

### 3\.5System Variants

We evaluate:

1. \(a\)single\-backbone models \(XLM\-R or mDeBERTa\),
2. \(b\)an MTL variant with a shared XLM\-R encoder and task\-specific heads,
3. \(c\)the heterogeneous ensemble with independent per\-subtask training \(submitted\)\.

Table 1:Official Macro\-F1 results by language for Subtasks 1–3\.Ensemis our final submission;mDeBdenotes mDeBERTa\-v3\-base\.blueindicates that the ensemble is tied for or achieves the best score, whileorangeindicates that the best non\-ensemble model achieves a strictly higher score than the ensemble\.## 4Experimental Setup

### 4\.1Data and Partitioning

We restrict training to the provided task data and do not use external lexicons\. The official datasetNaseem et al\. \([2026b](https://arxiv.org/html/2605.06231#bib.bib19)\)providestrain,dev, andtestpartitions\.

#### Development Phase\.

We used an internal 85/15 split of the officialtrainset for model selection\. For Subtask 1, standard stratification was applied; for Subtasks 2 and 3, we used iterative stratificationSechidis et al\. \([2011](https://arxiv.org/html/2605.06231#bib.bib24)\)to preserve rare label co\-occurrences under extreme sparsity\.

#### Test Phase\.

Upon the release ofdevlabels, we utilized the officialdevset as a hold\-out validation set for final hyperparameter tuning\. The final system was retrained on the union of the officialtrainanddevsets to generate predictions for the hiddentestset\.

### 4\.2Evaluation Metrics

All subtasks are evaluated using Macro\-F1:

Macro\-F1=1\|L\|​∑l∈LF​1l\.\\text\{Macro\-F1\}=\\frac\{1\}\{\|L\|\}\\sum\_\{l\\in L\}F1\_\{l\}\.For Subtasks 2 and 3,F​1lF1\_\{l\}is computed independently for each label and then averaged \(label\-wise macro\), which emphasizes rare categories\.

### 4\.3Implementation

#### Preprocessing\.

We applied minimal preprocessing: rows with missing or empty text were removed, while emojis were retained to preserve affective cues\. No task\-specific normalization beyond shared\-task standardization \(e\.g\., URLs as\[URL\]\) was performed\. Tokenization used the default tokenizer with truncation and dynamic padding, and a maximum sequence length of 256\.

#### Training setup\.

Experiments were conducted on NVIDIA A100 GPUs using PyTorch and Hugging Face Transformers \(bf16/tf32\)\. Models were fine\-tuned with AdamW and a linear scheduler \(warmup ratio 0\.1\) with weight decay set to 0\. On the internal validation set, we use early stopping with patience 2\. The learning rate was1×10−51\\times 10^\{\-5\}for XLM\-R \(epochs 4, batch size 32\) and2×10−52\\times 10^\{\-5\}for mDeBERTa \(epochs 5, batch size 64\)\. Additional implementation details are given in Appendix[D](https://arxiv.org/html/2605.06231#A4)\.

#### Thresholds and ensembling\.

For multi\-label settings, weighted binary cross\-entropy was used with a positive weight computed from label frequencies\. Per\-label threshold tuning was not adopted due to performance degradation, so a global thresholdτ=0\.5\\tau=0\.5was used\. The ensemble coefficientα\\alphawas set to 0\.7 after grid search\.

## 5Results and Discussion

![Refer to caption](https://arxiv.org/html/2605.06231v1/x2.png)Figure 2:Dataset imbalance analysis on the merged train\+dev set\. \(a\) Task\-level skew \(Language×\\timesTask\)\. \(b\) Subtask 3 label imbalance \(Language×\\timesLabel\)\.### 5\.1Official Evaluation Results

Table[3\.5](https://arxiv.org/html/2605.06231#S3.SS5)reports official test Macro\-F1 by language for all subtasks\. We use the ensemble as the final submission across all subtasks to maintain a single consistent pipeline\. In Subtask 2, it reaches Top 5 in Amharic \(4th\), Urdu \(5th\), Odia \(5th\), and Polish \(5th\)\. In Subtask 3, it ranks in the Top 5 for Amharic \(4th\), Arabic \(3rd\), English \(5th\), Khmer \(5th\), Spanish \(4th\), and Urdu \(3rd\)\.

### 5\.2Data Imbalance and Cross\-lingual Conditions

Figure[2](https://arxiv.org/html/2605.06231#S5.F2)summarizes train\+dev label distributions across subtasks and languages\. Figure[2](https://arxiv.org/html/2605.06231#S5.F2)\(a\) shows strong cross\-lingual prior shift in Subtask 1 \(10\.75/49\.45/90\.79% min/med/max polarized rate\) and overall skew across subtasks\. As shown in Figure[2](https://arxiv.org/html/2605.06231#S5.F2)\(b\),VilificationandStereotypeare more frequent than the remaining manifestation labels\. Overall, difficulty stems from sparsity interacting with cross\-lingual heterogeneityHu et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib14)\)\. Despite partial intra\-family consistencyPires et al\. \([2019](https://arxiv.org/html/2605.06231#bib.bib20)\); Wu and Dredze \([2019](https://arxiv.org/html/2605.06231#bib.bib27)\), extreme sparsityLauscher et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib16)\)and script/tokenization effectsRust et al\. \([2021](https://arxiv.org/html/2605.06231#bib.bib23)\)still limit fine\-grained transfer\.

### 5\.3Optimization Analysis

#### Loss functions\.

Development tests \(Table[2](https://arxiv.org/html/2605.06231#S5.T2)\) show Focal LossLin et al\. \([2018](https://arxiv.org/html/2605.06231#bib.bib17)\)benefits well\-represented languages but is unstable under sparsity\. Conversely, WBCE significantly boosts low\-resource settings \(e\.g\., yielding a \+0\.212 increase in Macro\-F1 for Telugu\) and improves the 22\-language average by 3\.5 points \(0\.492→\\rightarrow0\.527\)\.

Table 2:Optimization objective ablation for Subtask 2 \(Macro\-F1\)\.Base= BCE loss;Focal= Focal Loss;Δ\\Delta= the improvement of WBCE over Base\.
#### MTL vs\. Independent Modeling\.

The multi\-task learning \(MTL\) variant suggests negative transfer for the sparsest labels\. While shared representations can slightly benefit Subtask 1, Subtasks 2 and 3 suffer when low\-frequency labels compete with the dominant binary objective during shared updatesYu et al\. \([2020](https://arxiv.org/html/2605.06231#bib.bib28)\)\. We use independent per\-subtask modeling to preserve fine\-grained signals\.

### 5\.4Error Analysis and Post\-hoc Evaluation

Post\-hoc evaluation confirms strong cross\-lingual variability; additional figures are shown in Appendix[E](https://arxiv.org/html/2605.06231#A5)\. Weak transfer from Subtask 1 to Subtasks 2 and 3 \(e\.g\., Bengali and Telugu\) highlights fine\-grained label sparsity\. Category\-wise,Political\(Subtask 2\),VilificationandStereotype\(Subtask 3\) are the most reliable, whileGender/Sexual,Religious, andOther\(Subtask 2\), as well asDehumanization,Lack of Empathy, andInvalidation\(Subtask 3\), remain challenging\.

#### Miscalibration and label collapse\.

Low\-resource settings exhibit miscalibration \(high recall but low precision\) and label collapse, where sparse categories fall to near\-zero F1, suggesting expanded but noisy decision regionsDesai and Durrett \([2020](https://arxiv.org/html/2605.06231#bib.bib7)\)\.

#### Cross\-task inconsistency\.

Independent modeling causes hierarchical violations: some instances labeled non\-polarized in Subtask 1 receive positive Subtask 2 or 3 labels\. This suggests a trade\-off between robustness and coherence, motivating future post\-hoc gating or hierarchical calibration\.

#### Translation augmentation\.

Machine translation augmentation \(using 4,000 Gemini\-generated samples for Hausa\) failed to yield significant performance gains\. This result suggests that synthetic “translationese” and the resulting pragmatic shiftsJi \([2023](https://arxiv.org/html/2605.06231#bib.bib15)\)may obscure region\-specific idioms and subtle linguistic cues\.

## 6Conclusion

We presented a multilingual system for SemEval\-2026 Task 9 that addresses polarization detection and characterization across 22 languages\. Our final system combines XLM\-R and mDeBERTa in a heterogeneous ensemble and uses imbalance\-aware optimization to improve robustness\. Across official evaluations, the system remains competitive and shows that independent per\-subtask modeling is a strong practical alternative to shared MTL\.

Our analyses show that the main challenges are not only cross\-lingual variation, but also fine\-grained label sparsity, calibration instability, and cross\-task inconsistency\. Post\-hoc evaluation on the released test gold labels further confirms that high\-coverage labels are substantially easier to learn, while sparse manifestation labels remain the most fragile\. In future work, we plan to explore lightweight hierarchical calibration \(e\.g\., gating\-based consistency constraints\) and culturally grounded data synthesis to better capture region\-specific socio\-political idioms for fine\-grained polarization analysis\.

## Acknowledgments

We would like to express our sincere gratitude to Dr\. Çağrı Çöltekin for his invaluable guidance, insightful comments, and continuous support throughout this study\.

## References

- Ali et al\. \(2025\)Adem Chanie Ali, Seid Muhie Yimam, Abinew Ali Ayele, Chris Biemann, and Martin Semmann\. 2025\.[Silenced voices: social media polarization and women’s marginalization in peacebuilding during the Northern Ethiopia War](https://doi.org/doi:10.1515/icom-2025-0007)\.*i\-com*, 24\(2\):407–432\.
- Bail et al\. \(2018\)Christopher A\. Bail, Lisa P\. Argyle, Taylor W\. Brown, John P\. Bumpus, Haohan Chen, M\. B\. Fallin Hunzaker, Jaemin Lee, Marcus Mann, Friedolin Merhout, and Alexander Volfovsky\. 2018\.[Exposure to opposing views on social media can increase political polarization](https://doi.org/10.1073/pnas.1804840115)\.*Proceedings of the National Academy of Sciences*, 115\(37\):9216–9221\.
- Build Up \(2025\)Build Up\. 2025\.[Polarization footprint europe report](https://howtobuildup.org/wp-content/uploads/2025/11/Polarization-footrpint-Europe-report-.pdf)\.Technical report, Build Up\.
- Caruana \(1997\)Rich Caruana\. 1997\.[Multitask learning](https://doi.org/10.1023/A:1007379606734)\.*Machine Learning*, 28\(1\):41–75\.
- Conneau et al\. \(2020\)Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov\. 2020\.[Unsupervised cross\-lingual representation learning at scale](https://doi.org/10.18653/v1/2020.acl-main.747)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 8440–8451, Online\. Association for Computational Linguistics\.
- Davidson et al\. \(2017\)Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber\. 2017\.[Automated hate speech detection and the problem of offensive language](https://doi.org/10.1609/icwsm.v11i1.14955)\.In*Proceedings of the Eleventh International AAAI Conference on Web and Social Media \(ICWSM\)*, pages 512–515\.
- Desai and Durrett \(2020\)Shrey Desai and Greg Durrett\. 2020\.[Calibration of pre\-trained transformers](https://doi.org/10.18653/v1/2020.emnlp-main.21)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 295–302, Online\. Association for Computational Linguistics\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\. 2019\.[BERT: pre\-training of deep bidirectional transformers for language understanding](https://doi.org/10.18653/V1/N19-1423)\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\)*, pages 4171–4186\. Association for Computational Linguistics\.
- Elkan \(2001\)Charles Elkan\. 2001\.[The foundations of cost\-sensitive learning](https://doi.org/10.5555/1642194.1642224)\.In*Proceedings of the 17th International Joint Conference on Artificial Intelligence \- Volume 2*, IJCAI’01, page 973–978, San Francisco, CA, USA\. Morgan Kaufmann Publishers Inc\.
- Garimella et al\. \(2018\)Kiran Garimella, Gianmarco De Francisci Morales, Aristides Gionis, and Michael Mathioudakis\. 2018\.[Political discourse on social media: Echo chambers, gatekeepers, and the price of bipartisanship](https://arxiv.org/abs/1801.01665)\.*Preprint*, arXiv:1801\.01665\.
- He and Garcia \(2009\)Haibo He and Edwardo A\. Garcia\. 2009\.[Learning from imbalanced data](https://doi.org/10.1109/TKDE.2008.239)\.*IEEE Transactions on Knowledge and Data Engineering*, 21\(9\):1263–1284\.
- He et al\. \(2023\)Pengcheng He, Jianfeng Gao, and Weizhu Chen\. 2023\.[DeBERTaV3: Improving DeBERTa using ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing](https://arxiv.org/abs/2111.09543)\.*Preprint*, arXiv:2111\.09543\.
- Howard \(2019\)Jeffrey W\. Howard\. 2019\.[Free Speech and Hate Speech](https://doi.org/10.1146/annurev-polisci-051517-012343)\.*Annual Review of Political Science*, 22:93–109\.
- Hu et al\. \(2020\)Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson\. 2020\.[XTREME: A massively multilingual multi\-task benchmark for evaluating cross\-lingual generalization](https://arxiv.org/abs/2003.11080)\.*Preprint*, arXiv:2003\.11080\.
- Ji \(2023\)Meng Ji\. 2023\.[Cultural and linguistic bias of neural machine translation technology](https://doi.org/10.1017/9781108938976.005)\.In*Translation Technology in Accessible Health Communication*, pages 100–128\. Cambridge University Press\.
- Lauscher et al\. \(2020\)Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš\. 2020\.[From zero to hero: On the limitations of zero\-shot language transfer with multilingual Transformers](https://doi.org/10.18653/v1/2020.emnlp-main.363)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 4483–4499, Online\. Association for Computational Linguistics\.
- Lin et al\. \(2018\)Tsung\-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár\. 2018\.[Focal loss for dense object detection](https://arxiv.org/abs/1708.02002)\.*Preprint*, arXiv:1708\.02002\.
- Naseem et al\. \(2026a\)Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz, P Sam Sahil, Yiran Zhang, Marco Antonio Stranisci, Idris Abdulmumin, Özge Alacam, Cengiz Acarürk, Aisha Jabr, Saba Anwar, Abinew Ali Ayele, Elena Tutubalina, Aung Kyaw Htet, Xintong Wang, Surendrabikram Thapa, Tanmoy Chakraborty, Dheeraj Kodati, Sahar Moradizeyveh, Firoj Alam, Ye Kyaw Thu, Shantipriya Parida, Ihsan Ayyub Qazi, Nelson Odhiambo Onyango, Clemencia Siro, Ibrahim Said Ahmad, Lilian Wanzare, Adem Chanie Ali, Martin Semmann, Chris Biemann, Shamsuddeen Hassan Muhammad, and Seid Muhie Yimam\. 2026a\.SemEval\-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization\.In*Proceedings of the 20th International Workshop on Semantic Evaluation \(SemEval\-2026\)*, San Diego, CA, USA\. Association for Computational Linguistics\.
- Naseem et al\. \(2026b\)Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz, P Sam Sahil, Yiran Zhang, Marco Antonio Stranisci, Idris Abdulmumin, Özge Alacam, Cengiz Acartürk, Aisha Jabr, Saba Anwar, Abinew Ali Ayele, Simona Frenda, Alessandra Teresa Cignarella, Elena Tutubalina, Oleg Rogov, Aung Kyaw Htet, Xintong Wang, Surendrabikram Thapa, Kritesh Rauniyar, Tanmoy Chakraborty, Arfeen Zeeshan, Dheeraj Kodati, Satya Keerthi, Sahar Moradizeyveh, Firoj Alam, Arid Hasan, Syed Ishtiaque Ahmed, Ye Kyaw Thu, Shantipriya Parida, Ihsan Ayyub Qazi, Lilian Wanzare, Nelson Odhiambo Onyango, Clemencia Siro, Jane Wanjiru Kimani, Ibrahim Said Ahmad, Adem Chanie Ali, Martin Semmann, Chris Biemann, Shamsuddeen Hassan Muhammad, and Seid Muhie Yimam\. 2026b\.[Polar: A benchmark for multilingual, multicultural, and multi\-event online polarization](https://arxiv.org/abs/2505.20624)\.*Preprint*, arXiv:2505\.20624\.
- Pires et al\. \(2019\)Telmo Pires, Eva Schlinger, and Dan Garrette\. 2019\.[How multilingual is multilingual BERT?](https://doi.org/10.18653/v1/P19-1493)In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 4996–5001, Florence, Italy\. Association for Computational Linguistics\.
- Read et al\. \(2009\)Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank\. 2009\.[Classifier chains for multi\-label classification](https://doi.org/10.1007/978-3-642-04174-7_17)\.In*Machine Learning and Knowledge Discovery in Databases*, pages 254–269, Berlin, Heidelberg\. Springer Berlin Heidelberg\.
- Röttger et al\. \(2021\)Paul Röttger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert\. 2021\.[HateCheck: Functional tests for hate speech detection models](https://aclanthology.org/2021.acl-long.4)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 41–58\. Association for Computational Linguistics\.
- Rust et al\. \(2021\)Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych\. 2021\.[How good is your tokenizer? On the monolingual performance of multilingual language models](https://doi.org/10.18653/v1/2021.acl-long.243)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3118–3135, Online\. Association for Computational Linguistics\.
- Sechidis et al\. \(2011\)Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis P\. Vlahavas\. 2011\.[On the stratification of multi\-label data](https://doi.org/10.1007/978-3-642-23808-6_10)\.In*ECML/PKDD*\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin\. 2017\.[Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)\.In*Advances in Neural Information Processing Systems*, volume 30\. Curran Associates, Inc\.
- Waseem and Hovy \(2016\)Zeerak Waseem and Dirk Hovy\. 2016\.[Hateful symbols or hateful people? predictive features for hate speech detection on Twitter](https://doi.org/10.18653/v1/N16-2013)\.In*Proceedings of the NAACL Student Research Workshop*, pages 88–93, San Diego, California\. Association for Computational Linguistics\.
- Wu and Dredze \(2019\)Shijie Wu and Mark Dredze\. 2019\.[Beto, Bentz, Becas: The surprising cross\-lingual effectiveness of BERT](https://doi.org/10.18653/v1/D19-1077)\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 833–844, Hong Kong, China\. Association for Computational Linguistics\.
- Yu et al\. \(2020\)Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn\. 2020\.[Gradient surgery for multi\-task learning](https://proceedings.neurips.cc/paper_files/paper/2020/file/3fe78a8acf5fda99de95303940a2420c-Paper.pdf)\.In*Advances in Neural Information Processing Systems*, volume 33, pages 5824–5836\. Curran Associates, Inc\.

## Appendix AInput/Output Schemas

A subset of the training data for each subtask has been selected as illustrative examples\. The corresponding input and output schemas for the three subtasks are presented below\.

### A\.1Subtask 1

This subtask comprises a binary classification task to determine whether a given text exhibits polarized content\. A text is labeled as polarized \(Polarization= 1\) only if it explicitly expresses an opinion indicative of attitude polarization, accounting for the overall context and semantic meaning\. Texts lacking such characteristics are labeled as non\-polarized \(Polarization= 0\)\.

Table 3:Subtask 1 input examples\.Table 4:Subtask 1 output examples\.
### A\.2Subtask 2

This multi\-label classification task identifies the specific targets of polarization in each text\. Each text is evaluated across five predefined categories:political,racial/ethnic,religious,gender/sexual, andother\. A label of 1 denotes the presence of a given polarization type, while 0 indicates its absence\. Multiple labels may be assigned to a single text if multiple types are present\.

Table 5:Subtask 2 input examples\.Table 6:Subtask 2 output examples\.
### A\.3Subtask 3

This subtask identifies the specific manifestations of polarization in each text\. Each text is evaluated across six predefined categories:Stereotype,Vilification,Dehumanization,Extreme\_Language,Lack\_of\_Empathy, andInvalidation\. A label of 1 denotes presence, while 0 indicates absence\. Multiple labels may be assigned to a single text if multiple manifestations are present\.

For brevity, texts and category names are abbreviated in the input table\.

Table 7:Subtask 3 input examples\. Abbreviations: Stereo\.=Stereotype; Vilifi\.=Vilification; Dehuman\.=Dehumanization\.Table 8:Subtask 3 output examples\.

## Appendix BData Statistics

Table[9](https://arxiv.org/html/2605.06231#A2.T9)provides the detailed instance counts for each language across the training, development, and test sets\.

Table 9:Detailed dataset statistics for all 22 languages\.These languages cover a broad range of linguistic and cultural contexts, providing a diverse basis for detecting and analyzing online polarization in multilingual settings\.

## Appendix CDetailed Official Rankings

Table[10](https://arxiv.org/html/2605.06231#A3.T10)provides the official system rankings across all three subtasks for the 22 languages included in the dataset\. Subtask 3 was evaluated on a subset of 18 languages; instances where a language was not included in that subtask are marked with a dash \(—\)\.

Table 10:Official system rankings per language for Subtask 1, Subtask 2, and Subtask 3\.
## Appendix DImplementation Details

### D\.1Preprocessing

We apply minimal preprocessing across all subtasks\. Rows with missing text are removed, and examples whose cleaned text becomes empty are discarded\. No task\-specific normalization is applied beyond the standardization already present in the shared\-task data: URLs are already normalized as\[URL\], and the released files we use do not contain user placeholders such as@USER\. We retain emojis because they may carry affective or evaluative cues relevant to polarization\.

For tokenization, we use the default tokenizer associated with each backbone\. Texts are encoded withtruncation=True,padding=False, andmax\_length=256; dynamic padding is applied at batch time\. In practice, most posts are short, so truncation is triggered only for a small fraction of examples\.

### D\.2Hyperparameter Search and Training

All models are implemented in PyTorch using the Hugging Face Transformers library\. We tune the learning rate over\{1​e−5,2​e−5\}\\\{1\\mathrm\{e\}\{\-5\},2\\mathrm\{e\}\{\-5\}\\\}and use AdamW with a linear scheduler and warmup ratio 0\.1\. We set weight decay to 0, early stopping uses patience 2, and gradient clipping is not applied\.

XLM\-R is trained for 4 epochs, and mDeBERTa for 5 epochs\. We use per\-device batch sizes of 32 and 64, respectively, with gradient accumulation only when needed due to memory constraints\. We also test three random seeds \(42, 2025, 3072\), but do not use seed ensembling because trends are consistent across runs\.

For multi\-label settings, weighted binary cross\-entropy is implemented throughBCEWithLogitsLosswith a label\-dependentpos\_weightcomputed from class frequencies in the corresponding training split\. For the final run,pos\_weightis recomputed on the merged train\+dev data\. Per\-label threshold tuning was explored on development data but was not adopted because it reduced overall Macro\-F1 under extreme label sparsity\.

### D\.3Final Configuration Summary

After development\-stage tuning, we use the following final configuration for the submitted system:

- •XLM\-RoBERTa\-large:batch size 32, 4 epochs, learning rate1×10−51\\times 10^\{\-5\}\.
- •mDeBERTa\-v3\-base:batch size 64, 5 epochs, learning rate2×10−52\\times 10^\{\-5\}\.
- •Thresholding:global thresholdτ=0\.5\\tau=0\.5for Subtask 1 and all labels in Subtasks 2 and 3\.
- •Ensembling:weighted probability averaging withα=0\.7\\alpha=0\.7\.

## Appendix EAdditional Post\-hoc Visualizations

These figures provide supplementary post\-hoc analyses on the released test gold labels and complement the discussion in Section[5](https://arxiv.org/html/2605.06231#S5)\. They visualize per\-language performance, label\-wise F1, and precision–recall gaps that are summarized only briefly in the main text\.

### E\.1Per\-language Performance Across Subtasks

Figure[3](https://arxiv.org/html/2605.06231#A5.F3)highlights substantial cross\-lingual variability and the mismatch between strong Subtask 1 performance and weaker fine\-grained performance in some languages\. In particular, several languages remain competitive in binary polarization detection but drop noticeably in Subtasks 2 and 3, which is consistent with the role of label sparsity in the multi\-label settings\.

![Refer to caption](https://arxiv.org/html/2605.06231v1/x3.png)Figure 3:Post\-hoc Macro\-F1 by language and subtask on the released test gold labels\.
### E\.2Additional Analysis for Subtask 2

Figures[4\(a\)](https://arxiv.org/html/2605.06231#A5.F4.sf1)and[4\(b\)](https://arxiv.org/html/2605.06231#A5.F4.sf2)provide a label\-level view of Subtask 2\. Figure[4\(a\)](https://arxiv.org/html/2605.06231#A5.F4.sf1)shows that some target types are substantially easier to predict than others, consistent with the coverage–learnability pattern discussed in the main text\. Figure[4\(b\)](https://arxiv.org/html/2605.06231#A5.F4.sf2)further shows that several labels exhibit recall\-dominant behavior, which is consistent with over\-prediction under sparse supervision\.

![Refer to caption](https://arxiv.org/html/2605.06231v1/x4.png)\(a\)Average label\-wise F1 for Subtask 2, showing that some target categories are substantially more stable than others\.
![Refer to caption](https://arxiv.org/html/2605.06231v1/x5.png)\(b\)Average precision–recall gap for Subtask 2\. Positive values indicate labels for which recall tends to exceed precision, suggesting over\-prediction under sparsity\.

Figure 4:Supplementary post\-hoc visualizations for Subtask 2\.
### E\.3Additional Analysis for Subtask 3

Figures[5\(a\)](https://arxiv.org/html/2605.06231#A5.F5.sf1)and[5\(b\)](https://arxiv.org/html/2605.06231#A5.F5.sf2)provide a corresponding label\-level view of Subtask 3\. Figure[5\(a\)](https://arxiv.org/html/2605.06231#A5.F5.sf1)illustrates the relative difficulty of different manifestation categories, while Figure[5\(b\)](https://arxiv.org/html/2605.06231#A5.F5.sf2)shows that recall often exceeds precision for the harder labels, supporting the main\-text observation that sparse manifestation labels are especially prone to miscalibration\.

![Refer to caption](https://arxiv.org/html/2605.06231v1/x6.png)\(a\)Average label\-wise F1 for Subtask 3, illustrating the relative difficulty of different manifestation categories\.
![Refer to caption](https://arxiv.org/html/2605.06231v1/x7.png)\(b\)Average precision–recall gap for Subtask 3\. Positive values indicate recall\-dominant behavior, consistent with miscalibration on sparse labels\.

Figure 5:Supplementary post\-hoc visualizations for Subtask 3\.

Similar Articles

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

arXiv cs.CL

This paper presents the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, which applies activation steering to multilingual LLMs using language vectors from parallel FLORES data. The system achieved 86.96% accuracy in the MCQ track, ranking 7th out of 17 teams, and post-hoc analyses reveal that gains are layer-sensitive and vary across language-region pairs.

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Hugging Face Daily Papers

This paper presents the winning system for SemEval-2026 Task 8's generation subtask, using a heterogeneous ensemble of seven LLMs with dual prompting strategies and a GPT-4o-mini judge to select the best response. The system achieved first place with a conditioned harmonic mean of 0.7827, outperforming all baselines and demonstrating the value of model diversity.