Multi-Stage Training for Abusive Comment Detection in Indic Languages

arXiv cs.CL 05/22/26, 04:00 AM Papers

Summary

This paper proposes a multi-stage training pipeline using language-based preprocessing and an ensemble of models to detect abusive comments in Indic languages, aiming to minimize false positives while preserving freedom of expression.

arXiv:2605.22380v1 Announce Type: new Abstract: In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.

Original Article

View Cached Full Text

Cached at: 05/22/26, 08:46 AM

# Multi-Stage Training for Abusive Comment Detection in Indic Languages
Source: [https://arxiv.org/abs/2605.22380](https://arxiv.org/abs/2605.22380)
[View PDF](https://arxiv.org/pdf/2605.22380)

> Abstract:In recent years social media has become an increasingly popular tool for communication\. People use it to share their ideas, exchange information, and discuss thoughts\. Given its prevalence and widespread reach, social media must remain a safe space for people\. Content generated on social media can be abusive and it has become increasingly important to detect such content\. In this paper, we use a language\-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection\. Through extensive experimentation, we propose a pipeline that minimizes the false\-positive rate \(marking non\-abusive as abusive\) so that these systems can detect abusive comments without undermining the freedom of expression\.

## Submission history

From: Pranshu Rastogi \[[view email](https://arxiv.org/show-email/a923246b/2605.22380)\] **\[v1\]**Thu, 21 May 2026 12:09:53 UTC \(486 KB\)

Multi-Stage Training for Abusive Comment Detection in Indic Languages

Similar Articles

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Been stuck on a unique NLP problem [D]

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Submit Feedback

Similar Articles

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Been stuck on a unique NLP problem [D]

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Measuring, Localizing, and Ablating Alignment Signatures in LLMs