Multi-Stage Training for Abusive Comment Detection in Indic Languages

arXiv cs.CL Papers

Summary

This paper proposes a multi-stage training pipeline using language-based preprocessing and an ensemble of models to detect abusive comments in Indic languages, aiming to minimize false positives while preserving freedom of expression.

arXiv:2605.22380v1 Announce Type: new Abstract: In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:46 AM

# Multi-Stage Training for Abusive Comment Detection in Indic Languages
Source: [https://arxiv.org/abs/2605.22380](https://arxiv.org/abs/2605.22380)
[View PDF](https://arxiv.org/pdf/2605.22380)

> Abstract:In recent years social media has become an increasingly popular tool for communication\. People use it to share their ideas, exchange information, and discuss thoughts\. Given its prevalence and widespread reach, social media must remain a safe space for people\. Content generated on social media can be abusive and it has become increasingly important to detect such content\. In this paper, we use a language\-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection\. Through extensive experimentation, we propose a pipeline that minimizes the false\-positive rate \(marking non\-abusive as abusive\) so that these systems can detect abusive comments without undermining the freedom of expression\.

## Submission history

From: Pranshu Rastogi \[[view email](https://arxiv.org/show-email/a923246b/2605.22380)\] **\[v1\]**Thu, 21 May 2026 12:09:53 UTC \(486 KB\)

Similar Articles

Been stuck on a unique NLP problem [D]

Reddit r/MachineLearning

Developer seeks advice on handling English-Hindi code-mixed text classification without heavy LLMs, as sentence transformers fail on Romanized Hindi.

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

arXiv cs.LG

This paper investigates how post-training of LLMs introduces AI-like stylistic regularities and proposes PASTA, a training-free method to localize and ablate these alignment signatures, reducing AI detection rates while maintaining coherence across 11 models and 6 detectors.