PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat

arXiv cs.CL Papers

Summary

This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.

arXiv:2605.07201v1 Announce Type: new Abstract: This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5\% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset's annotation patterns and their impact on model generalization, revealing a critical ''validation trap'' phenomenon where high validation performance correlates with poor test transfer.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:54 AM

# PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
Source: [https://arxiv.org/html/2605.07201](https://arxiv.org/html/2605.07201)
###### Abstract

This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities\. The task involves classifying World of Tanks chat messages into six toxicity categories: Non\-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism\. We explore multiple approaches including encoder\-based models, instruction\-tuned LLMs with LoRA fine\-tuning, hierarchical classification, one\-vs\-rest strategies, and various ensemble methods\. Our best system combines Llama 3\.1 8B with carefully calibrated 5% synthetic data augmentation, achieving an F1\-macro score of 0\.6234 on the test set, placing 4th out of 35 participating teams\. We provide extensive analysis of the dataset’s annotation patterns and their impact on model generalization, revealing a critical “validation trap” phenomenon where high validation performance correlates with poor test transfer\.

PSK@EEUCA 2026: Fine\-Tuning Large Language Models with Synthetic Data Augmentation for Multi\-Class Toxicity Detection in Gaming Chat

Srikar Kashyap PulipakaIndependent Researchersrikar\.kashyap@gmail\.com

## 1Introduction

Online gaming communities face significant challenges with toxic behavior, including harassment, hate speech, and threats\. The EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming CommunitiesThapaet al\.\([2026](https://arxiv.org/html/2605.07201#bib.bib1)\)focuses on detecting and classifying toxicity in World of Tanks chat messages, aiming to promote healthier digital spaces through AI\-based moderation tools\.

The task presents several unique challenges:

- •Extreme class imbalance \(81% Non\-toxic, <1% for rare classes\)
- •Short, informal text with gaming\-specific vocabulary
- •Multilingual content requiring cross\-lingual understanding
- •Subtle distinctions between toxic categories \(e\.g\., skill\-based insults vs\. identity\-based hate\)

Our main strategy combines instruction\-tuned LLMs \(Llama 3\.1 8B\) with parameter\-efficient fine\-tuning via LoRA and carefully calibrated synthetic data augmentation\. We find that a narrow 5% synthetic data ratio is optimal, with deviations in either direction significantly degrading test performance\.

Our key discovery is the “validation trap” phenomenon: models achieving high validation F1 through conservative predictions \(matching validation distribution\) perform poorly on test data\. This affected our larger models most severely, with 12B models showing 0\.66 validation F1 but only 0\.52 test F1\. Our final system achieves 0\.6234 F1\-macro, placing 4th overall out of 35 teams\.

## 2Background

### 2\.1Task Description

The EEUCA 2026 toxicity detection taskThapaet al\.\([2026](https://arxiv.org/html/2605.07201#bib.bib1)\)is part of the 9th Workshop on Event Extraction and UnderstandingHürriyetoğluet al\.\([2026](https://arxiv.org/html/2605.07201#bib.bib2)\)\. The task requires classifying gaming chat messages into six categories based on the annotation schema fromBhandariet al\.\([2023](https://arxiv.org/html/2605.07201#bib.bib4)\):

1. 0\.Non\-toxic: Normal or positive communication
2. 1\.Insults/Flaming: Personal attacks targeting gaming skill
3. 2\.Other Offensive: Inappropriate content not directly attacking
4. 3\.Hate/Harassment: Targeted abuse based on identity
5. 4\.Threats: Violence or harm threats
6. 5\.Extremism: Hate ideology and dehumanization

### 2\.2Dataset

The dataset is derived from the GameTox corpusNaseemet al\.\([2025](https://arxiv.org/html/2605.07201#bib.bib3)\), comprising World of Tanks chat messages\. Table[1](https://arxiv.org/html/2605.07201#S2.T1)shows the severe class imbalance, with Non\-toxic messages comprising 81% and rare classes \(Threats, Extremism\) together representing less than 0\.2%\.

Table 1:Training set class distribution showing severe imbalance\.Our analysis revealed significant data quality patterns: 40\.2% of training messages are exact duplicates, and 13\.4% have the same text with different labels\. Interestingly, training on deduplicated data hurt performance \(0\.44 vs 0\.60 F1\), suggesting duplicates provide beneficial implicit oversampling\.

### 2\.3Related Work

Toxicity detection has been extensively studied using transformer\-based modelsDevlinet al\.\([2019](https://arxiv.org/html/2605.07201#bib.bib7)\); Liuet al\.\([2019](https://arxiv.org/html/2605.07201#bib.bib8)\)\. Recent work has shown that instruction\-tuned LLMs can achieve strong performance on classification tasksWeiet al\.\([2022](https://arxiv.org/html/2605.07201#bib.bib10)\); Thapaet al\.\([2025](https://arxiv.org/html/2605.07201#bib.bib6)\)\. Parameter\-efficient fine\-tuning methods like LoRAHuet al\.\([2022](https://arxiv.org/html/2605.07201#bib.bib9)\)enable adaptation of large models with limited resources\.

Gaming\-specific toxicity presents unique challenges due to domain vocabulary and skill\-based criticism that may or may not constitute toxicityKwaket al\.\([2015](https://arxiv.org/html/2605.07201#bib.bib11)\)\. Hate speech detection more broadly has been studied with various approachesPariharet al\.\([2021](https://arxiv.org/html/2605.07201#bib.bib5)\)\.

## 3System Overview

### 3\.1Model Architecture

We experimented with multiple architectures:

- •XLM\-RoBERTa Large\(560M\): Full fine\-tuning
- •Gemma 2BGemma Team \([2024](https://arxiv.org/html/2605.07201#bib.bib15)\): LoRA \+ 8\-bit quantization
- •Gemma 3 12BGemma Team \([2024](https://arxiv.org/html/2605.07201#bib.bib15)\): LoRA \+ 4\-bit quantization
- •Llama 3\.1 8BMeta AI \([2024](https://arxiv.org/html/2605.07201#bib.bib14)\): LoRA \+ 4\-bit quantization \(best\)

Our final system uses Llama 3\.1 8B with 4\-bit NF4 quantizationDettmerset al\.\([2023](https://arxiv.org/html/2605.07201#bib.bib13)\)and LoRA adapters \(rank=16, alpha=64\)\.

### 3\.2Prompt Engineering

Following insights that class definitions help LLMs discriminate between similar categories, we prepend structured definitions to each input:

> Classify gaming chat toxicity: 0=Non\-toxic: Normal/positive chat 1=Insults: Personal attacks, slurs 2=Other Offensive: Inappropriate but not direct 3=Hate/Harassment: Targeted abuse 4=Threats: Violence/harm threats 5=Extremism: Hate ideology Message: \[input text\]

This “short” prompt style achieved optimal balance between context and avoiding truncation\.

### 3\.3Synthetic Data Augmentation

We generate synthetic training data via LLM\-based paraphrase augmentation, focusing on minority classes\. We used a paraphrase\-only strategy after preliminary direct\-generation experiments produced generic messages that did not match the short, slang\-heavy style of real World of Tanks chat\. Each source message was rewritten with the following template:

> Rewrite this World of Tanks game chat message using different words but keeping the same meaning and toxicity level\. Original: \[message\] Requirements: Keep EXACT same meaning and level of toxicity; use natural gaming language, abbreviations, slang; similar length \(3\-\-20 words\)\. Output ONLY the rewritten message\.

The synthetic pool contained 10,464 filtered paraphrases, all from minority toxicity classes: 8,348 for Class 2 \(Other Offensive\), 1,633 for Class 3 \(Hate/Harassment\), 343 for Class 4 \(Threats\), and 140 for Class 5 \(Extremism\)\. We applied basic cleaning, invalid\-label and length filtering, label\-leakage regex filtering, and embedding\-based deduplication within the synthetic set\. Since paraphrases are intentionally close to their source messages, we did not remove paraphrases for high similarity to the original training examples\. Synthetic examples were added only to the training partition after splitting real data; validation remained 100% real\.

For the final 5% setting, we sampled 1,921 synthetic examples from this pool \(1,539 Class 2, 282 Class 3, 64 Class 4, 36 Class 5\), yielding an actual synthetic share of 4\.998% of the training data\. The synthetic ratio proved critical:

- •5% synthetic: Optimal, with best test transfer
- •2\-3%: Insufficient, poor test transfer
- •7\-10%: Overfitting to synthetic patterns
- •15%: Substantial degradation

The narrow optimal range suggests synthetic data helps by making predictions more “aggressive” on minority classes, better matching test distribution\.

## 4Alternative Approaches

We explored several alternative strategies that ultimately underperformed:

Hierarchical Classification:Two\-stage approach \(binary toxic/non\-toxic, then 5\-class among toxic\) achieved 0\.67 validation F1 but only 0\.47 test F1, the largest generalization gap observed\.

One\-vs\-Rest:Six binary classifiers with aggressive oversampling \(up to 500x\) and focal lossLinet al\.\([2017](https://arxiv.org/html/2605.07201#bib.bib12)\)\. Too conservative at 0\.56 validation F1\.

Transfer Learning:Pre\-training on DOTA 2 toxicity data before fine\-tuning resulted in validation trap \(0\.68 val→\\rightarrow0\.55 test\)\.

Ensemble Methods:Probability averaging, voting, and confidence routing generally hurt performance because our best single model dominated all classes\.

Post\-hoc Calibration:Platt scaling, isotonic regression, and temperature scaling provided no improvement\.

## 5Experimental Setup

### 5\.1Training Configuration

- •Model: Llama 3\.1 8B
- •Quantization: 4\-bit NF4
- •LoRA: rank=16, alpha=64, dropout=0\.0
- •Learning rate: 5e\-5 \(cosine schedule\)
- •Epochs: 4
- •Batch size: 4 \(gradient accumulation: 4\)
- •Loss: class\-weighted cross\-entropy
- •Synthetic ratio: 5%
- •Max sequence length: 384

### 5\.2Evaluation

The official metric is macro\-averaged F1 score across all six classes\. We used the provided validation split for development and hyperparameter tuning\.

## 6Results

### 6\.1Main Results

Table[2](https://arxiv.org/html/2605.07201#S6.T2)compares our approaches\. Llama 3\.1 8B with 5% synthetic data achieves the best test performance\. The unboosted 5% synthetic model scored 0\.6232; a small post\-hoc Class 2 boost increased the official submitted score to 0\.6234\.

Table 2:System comparison\. Best test result in bold\.
### 6\.2Synthetic Data Ablation

Table[3](https://arxiv.org/html/2605.07201#S6.T3)shows the critical sensitivity to synthetic ratio\.

Table 3:Effect of synthetic data ratio on Llama 8B\.To understand why 5% transferred best, we compared test prediction distributions for the Llama 8B models in Table[4](https://arxiv.org/html/2605.07201#S6.T4)\. The 5% model reduced Non\-toxic predictions and increased predictions for Classes 2 and 3, the confusable minority categories most affected by the train/test annotation shift\. Higher synthetic ratios did not preserve this balance in class\-level decisions and reduced test F1\.

Table 4:Test prediction distribution for Llama 8B synthetic\-data variants\.
### 6\.3Per\-class Performance

Table[5](https://arxiv.org/html/2605.07201#S6.T5)shows per\-class test F1 for the final submitted system\. Performance correlates roughly with class frequency, with Class 2 \(Other Offensive\) and Class 3 \(Hate/Harassment\) being particularly challenging\.

Table 5:Per\-class F1 for the final submitted system\.

## 7Analysis

### 7\.1The Validation Trap

Our most significant finding is the “validation trap”: models achieving high validation F1 through conservative predictions \(matching the 81% Non\-toxic distribution\) performed poorly on test\. Evidence includes:

- •Gemma 12B: 0\.66 val→\\rightarrow0\.52 test
- •Transfer learning: 0\.68 val→\\rightarrow0\.55 test
- •Two\-stage: 0\.67 val→\\rightarrow0\.47 test

Models predicting more minority classes \(2, 3\) performed better on test, suggesting different annotation patterns between splits\.

### 7\.2Why 5% Synthetic Works

The 5% ratio appears to increase minority class predictions without overwhelming original patterns\. The distribution analysis in Table[4](https://arxiv.org/html/2605.07201#S6.T4)supports this interpretation: relative to the no\-synthetic Llama 8B model, the 5% model predicts fewer Non\-toxic messages and more Class 2/3 messages, which improves test transfer\. Higher synthetic ratios did not yield the same class\-level accuracy: the 10% model shifted predictions further toward Class 2 but lost roughly 0\.038 test F1, suggesting that excessive synthetic data can reinforce artifacts or shift the model away from the test annotation pattern\.

### 7\.3Error Analysis

Common error patterns include:

- •Confusion between Class 1 \(Insults\) and Class 2 \(Other Offensive\)
- •Multilingual messages misclassified as Non\-toxic
- •Gaming slang incorrectly flagged as toxic

## 8Conclusion

We presented a comprehensive exploration of approaches for gaming toxicity detection\. Key findings:

1. 1\.Llama 3\.1 8B outperformed both smaller and larger models
2. 2\.Synthetic data has a narrow sweet spot \(5%\)
3. 3\.Validation metrics can be misleading due to distribution shift
4. 4\.Ensembles don’t help when one model dominates

Our system achieves 0\.6234 F1\-macro, placing 4th out of 35 teams\. Future work could explore better handling of distribution shift and external gaming\-specific data\.

## Limitations

Our analysis is limited to this specific dataset\. The “validation trap” phenomenon may be dataset\-specific and not generalize\. Computational constraints limited exploration of larger models and longer training\. The synthetic data approach requires access to commercial LLM APIs\.

## Ethics Statement

This work involves detecting toxic content in gaming chat\. Models could potentially be misused to generate toxic content or for surveillance\. We advocate for responsible deployment in content moderation systems with human oversight, transparency about automated decisions, and appeal mechanisms for users\.

## References

- A\. Bhandari, S\. B\. Shah, S\. Thapa, U\. Naseem, and M\. Nasim \(2023\)CrisisHateMM: multimodal analysis of directed and undirected hate speech in text\-embedded images from Russia–Ukraine conflict\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1994–2003\.Cited by:[§2\.1](https://arxiv.org/html/2605.07201#S2.SS1.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized LLMs\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§3\.1](https://arxiv.org/html/2605.07201#S3.SS1.p3.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4171–4186\.Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p1.1)\.
- Gemma Team \(2024\)Gemma: open models based on gemini research and technology\.arXiv preprint arXiv:2403\.08295\.Cited by:[2nd item](https://arxiv.org/html/2605.07201#S3.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2605.07201#S3.I1.i3.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p1.1)\.
- A\. Hürriyetoğlu, S\. Thapa, H\. Tanev, L\. Thapa, and S\. Adhikari \(2026\)Overview of the workshop on event extraction and understanding: challenges and applications\.InProceedings of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications \(EEUCA\),Cited by:[§2\.1](https://arxiv.org/html/2605.07201#S2.SS1.p1.1)\.
- H\. Kwak, J\. Blackburn, and S\. Han \(2015\)Exploring cyberbullying and other toxic behavior in team competition online games\.InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems,pp\. 3739–3748\.Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p2.1)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 2980–2988\.Cited by:[§4](https://arxiv.org/html/2605.07201#S4.p3.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p1.1)\.
- Meta AI \(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[4th item](https://arxiv.org/html/2605.07201#S3.I1.i4.p1.1)\.
- U\. Naseem, S\. Shiwakoti, S\. B\. Shah, S\. Thapa, and Q\. Zhang \(2025\)GameTox: a comprehensive dataset and analysis for enhanced toxicity detection in online gaming communities\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 440–447\.Cited by:[§2\.2](https://arxiv.org/html/2605.07201#S2.SS2.p1.1)\.
- A\. S\. Parihar, S\. Thapa, and S\. Mishra \(2021\)Hate speech detection using natural language processing: applications and challenges\.In2021 5th International Conference on Trends in Electronics and Informatics \(ICOEI\),pp\. 1302–1308\.Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p2.1)\.
- S\. Thapa, S\. Shiwakoti, S\. B\. Shah, S\. Adhikari, H\. Veeramani, M\. Nasim, and U\. Naseem \(2025\)Large language models \(llm\) in computational social science: prospects, current state, and challenges\.Social Network Analysis and Mining15\(1\),pp\. 1–30\.Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p1.1)\.
- S\. Thapa, S\. Shiwakoti, S\. B\. Shah, K\. Rauniyar, L\. Thapa, S\. Adhikari, K\. T\. Johnson, A\. Hürriyetoğlu, H\. Tanev, and U\. Naseem \(2026\)Understanding toxic behavior in gaming communities using ai to promote healthier digital spaces\.InProceedings of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications \(EEUCA\),Cited by:[§1](https://arxiv.org/html/2605.07201#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.07201#S2.SS1.p1.1)\.
- J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,Cited by:[§2\.3](https://arxiv.org/html/2605.07201#S2.SS3.p1.1)\.

## Appendix AFull Test Performance

Table[6](https://arxiv.org/html/2605.07201#A1.T6)reports the full test\-set classification report for the final submitted system\. These scores were computed after the official test labels were released, using the submitted predictions that achieved 0\.6234 macro\-F1\.

ClassPrecisionRecallF1Support0: Non\-toxic0\.96200\.92420\.942743511: Insults/Flaming0\.75630\.73180\.74387422: Other Offensive0\.33960\.61280\.43702353: Hate/Harassment0\.41030\.44440\.4267364: Threats0\.30000\.37500\.333385: Extremism0\.75001\.00000\.85713Macro average0\.58640\.68140\.62345375Weighted average0\.90160\.88000\.88875375Table 6:Full test\-set classification report for the final submitted system\.
## Appendix BAdditional Experimental Results

Table[7](https://arxiv.org/html/2605.07201#A2.T7)summarizes additional systems and ablations explored during development\. The pattern reinforces the main paper’s validation\-trap finding: several systems improved validation F1 but transferred poorly to the test set, while the final Llama 8B system with a small amount of synthetic data gave the best test performance\.

Table 7:Additional systems and ablations evaluated during development\.

Similar Articles

Detoxification for LLM: From Dataset Itself

arXiv cs.CL

Researchers propose HSPD, a corpus-level detoxification pipeline that rewrites toxic spans in pretraining data while preserving semantics, achieving state-of-the-art toxicity reduction on GPT-2 XL, LLaMA-2, OPT, and Falcon models.

Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

arXiv cs.CL

Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.

Lessons learned on language model safety and misuse

OpenAI Blog

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.