Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text
Summary
This paper introduces a Bangla event detection benchmark with noisy text (ASR, orthographic corruption) and evaluates encoder-only and decoder-only LLMs, finding decoder models more robust to noise.
View Cached Full Text
Cached at: 07/01/26, 05:31 AM
# Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text
Source: [https://arxiv.org/html/2606.30914](https://arxiv.org/html/2606.30914)
Tanvir Ahmed Sijan1, †S\. M Golam Rifat2Nayeemul Islam3 Md\. Musfique Anwar1 1Jahangirnagar University, Dhaka, Bangladesh, 2Rajshahi University of Engineering & Technology, Rajshahi, Bangladesh 3Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, \{sijantanv, golamrifat, nayeemulislam\.eee\.buet\}@gmail\.com manwar@juniv\.edu†Corresponding author
###### Abstract
Event detection \(ED\) systems are typically evaluated on clean, curated text, leaving their robustness to real\-world noise largely unexplored, particularly for low\-resource languages such as Bangla\. We introduce a generalized Bangla news event ontology and a benchmark comprising 9,979 annotated sentences across 40 event subtypes, spanning clean news text, real\-world Automatic Speech Recognition \(ASR\) transcripts, and orthographically corrupted text\. We systematically evaluate fine\-tuned encoder\-only models \(BanglaBERT and XLM\-R\) alongside instruction\-tuned decoder\-only large language models \(Llama 3 and Gemma 3\)\. Our results reveal a clear architectural trade\-off: encoder models achieve higher performance on clean text but degrade substantially under noise, whereas decoder\-only LLMs are markedly more robust, particularly when event triggers are corrupted\. We further show that embedding annotation guidelines during instruction tuning establishes a higher performance baseline on noisy text but yields inconsistent reductions in performance degradation across noisy conditions\. Finally, model scaling consistently improves the robustness of decoder\-only LLMs, while combined training on clean and noisy data serves as an effective regularization strategy that disproportionately benefits encoder architectures, significantly narrowing the robustness gap\.
\[ Path=\./font/, Script=Bengali, SizeFeatures=Size=11 \]
Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text
Tanvir Ahmed Sijan1, †S\. M Golam Rifat2Nayeemul Islam3Md\. Musfique Anwar11Jahangirnagar University, Dhaka, Bangladesh,2Rajshahi University of Engineering & Technology, Rajshahi, Bangladesh3Bangladesh University of Engineering and Technology, Dhaka, Bangladesh,\{sijantanv, golamrifat, nayeemulislam\.eee\.buet\}@gmail\.commanwar@juniv\.edu†Corresponding author
## 1Introduction
The task of event detection \(ED\) identifies and categorizes events in natural language\(Ahn,[2006](https://arxiv.org/html/2606.30914#bib.bib1)\)\. It serves as a foundational component in information retrieval\. Extracting structured event frames is critical for downstream applications, particularly in emergency monitoring systems where rapid and accurate information retrieval is paramount\. Despite its importance, the majority of event detection research focuses almost exclusively on clean and carefully curated text\(Wang et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib38); Pouran Ben Veyseh et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib27); Yao et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib40)\)\. In real\-world applications, data is inherently noisy\. Even minor typographical errors or transcription faults can cause traditional event detection systems to become confused and miss vital events entirely\.
Figure 1:Real\-world noise substantially degrades event detection, even when the event trigger itself remains unchanged\. Clean examples are drawn from our Clean Test set, while their counterparts are generated using simulated orthographic noise\. ASR examples are collected from real Bangla news video transcripts and independently annotated using the same event ontology, introducing naturally occurring transcription errors and out\-of\-distribution vocabulary and transcription artifacts\.As a low\-resource language, event detection studies in Bangla remain limited\. Existing work has primarily focused on clean text and narrow domains, such as violent incidents\(Khandokar et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib18); Dey et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib7); Ali Khandokar et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib3)\), disaster\(Dave et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib6)\)or crime\-related events\(Hossain et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib13)\)\. Furthermore, there is currently no generalized news event ontology for trigger\-based event detection comparable to widely used schemas such as ACE 2005\(Walker, Christopher et al\.,[2006](https://arxiv.org/html/2606.30914#bib.bib37)\), thereby limiting the systematic study of the task\.
Even research on noisy Bangla text remains scarce, and the few existing studies have predominantly focused on sentiment analysisIslam et al\. \([2021](https://arxiv.org/html/2606.30914#bib.bib17)\); Elahi et al\. \([2024](https://arxiv.org/html/2606.30914#bib.bib10)\)\. Sentiment analysis and event detection pose fundamentally different challenges\. While sentiment analysis typically operates at the sequence or document level, event detection requires fine\-grained token\-level identification and classification of event triggers\.
Historically, event extraction has been dominated by BERT\-based encoder models\(Wang et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib38); Pouran Ben Veyseh et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib27); Huang et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib15)\)\. This stands in contrast to the decoder\-only models that dominate the current LLM landscape, whose application to structured extraction tasks has traditionally been limited by hallucination and difficulties in generating outputs that conform to predefined structures\. However, recent studies have attempted to improve the performance of LLMs on event extraction through techniques such as code\-like representations\(Wang et al\.,[2023](https://arxiv.org/html/2606.30914#bib.bib39)\), incorporation of sentence\-level contextual information\(Al Monsur et al\.,[2026](https://arxiv.org/html/2606.30914#bib.bib2)\), and instruction tuning with annotation guidelines to enhance cross\-schema generalization\(Srivastava et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib34)\), while generative formulations have also been explored for complex event argument extraction\(Sharif et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib31)\)\.
Given these recent methodological advances for adapting LLMs to structured prediction tasks, an important question emerges regarding how these models compare with their BERT\-based counterparts under real\-world noisy conditions\. This question is particularly relevant for Bangla, where the number and scale of available pre\-trained encoder models are limited, while recent multilingual LLMs have demonstrated increasingly strong capabilities in the language\.
To address these questions, we annotate 5,320 sentences collected from Bangla newspapers and 4,659 sentences of automatic speech recognition \(ASR\) transcripts obtained from Bangla news videos within the same domain\. We evaluate both encoder\-based and decoder\-only models for event detection\. For decoder\-only models, we formulate event detection as a structured code generation task followingWang et al\. \([2023](https://arxiv.org/html/2606.30914#bib.bib39)\), with optionally annotation guidelines embedded in the prompt\(Sainz et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib29); Srivastava et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib34)\)\. In addition, to simulate orthographic noise, we employ the error generation algorithm ofSifat et al\. \([2020](https://arxiv.org/html/2606.30914#bib.bib32)\), which was developed by analyzing common Bengali writing patterns and typographical behaviors, and evaluate model robustness under varying degrees of noise\. In summary, our contributions are as follows:
- •We develop a generalized news\-domain event schema for Bangla event detection and release a dataset comprising 9,979 annotated sentences spanning both clean news text and noisy ASR transcripts, containing 7813 event mentions across 40 event subtypes\.
- •We provide a systematic comparison of encoder\-based and decoder\-only architectures with varying parameter sizes and degrees of Bangla support under multiple noisy conditions\. For decoder\-only models, we adopt recent recommendations for structured output generation through instruction tuning and code\-based representations\.
- •We present the first comprehensive study of robustness for Bangla event detection under both Real\-World ASR\-induced and simulated orthographic noise, providing insights into the relative strengths and limitations of modern multilingual LLMs and traditional encoder\-based approaches\.
Figure 2:Overview of the dataset construction, training, and evaluation pipeline\. We develop a generalized Bangla news event ontology, collect topic\-balanced news and ASR corpora, and annotate both using the same event schema\. Encoder\-only and decoder\-only models are trained under two settings: Clean and Combined \(Clean \+ ASR\) Training Set\. Decoder\-only LLMs are instruction\-tuned using code\-format prompts with and without annotation guidelines embedded as Python docstrings\. Robustness is evaluated on the Clean, real\-world ASR transcription noise, and simulated orthographic noise test sets\.
## 2Related Work
Early event detection systems relied on feature engineering and statistical learning techniques\(Ahn,[2006](https://arxiv.org/html/2606.30914#bib.bib1); Patwardhan and Riloff,[2009](https://arxiv.org/html/2606.30914#bib.bib26); Hong et al\.,[2011](https://arxiv.org/html/2606.30914#bib.bib12); Li et al\.,[2013](https://arxiv.org/html/2606.30914#bib.bib22)\)\. More recently, pretrained Transformer encoders such as BERT have become the dominant paradigm, substantially advancing trigger detection performance\(Nguyen et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib24); Wang et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib38); Pouran Ben Veyseh et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib27); Huang et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib15)\)\. Alongside these architectural advances, recent work has also focused on developing event detection datasets spanning diverse domains\(Kim et al\.,[2009](https://arxiv.org/html/2606.30914#bib.bib19); Sims et al\.,[2019](https://arxiv.org/html/2606.30914#bib.bib33); Le and Nguyen,[2021](https://arxiv.org/html/2606.30914#bib.bib20); Yao et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib40)\)and languages\(Pouran Ben Veyseh et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib27); Touileb et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib36)\)\. Large\-scale resources such as MAVEN\(Wang et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib38)\), RAMS\(Ebner et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib9)\), and the TextEE benchmarkHuang et al\. \([2024](https://arxiv.org/html/2606.30914#bib.bib15)\)have become standard evaluation platforms for modern event detection research\.
Compared with the extensive literature on English event detection, Bangla has received considerably less attention\. Existing studies have primarily developed task\-specific datasets and models targeting individual application domains, including violent incidents\(Khandokar et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib18); Dey et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib7); Ali Khandokar et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib3)\), disasters\(Dave et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib6)\), and crime\-related news\(Hossain et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib13)\)\. Moreover, publicly available resources do not provide a generalized news\-domain ontology comparable to ACE 2005\(Walker, Christopher et al\.,[2006](https://arxiv.org/html/2606.30914#bib.bib37)\), making standardized evaluation across diverse event categories difficult\. Studies on noisy Bangla text are likewise limited and have focused primarily on sentence\-level tasks such as sentiment analysis\(Islam et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib17); Elahi et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib10)\)rather than token\-level trigger identification\.
The rapid progress of decoder\-only LLMs has recently motivated their application to event extraction\. Recent studies have explored adapting LLMs to this task through structured code representations\(Wang et al\.,[2023](https://arxiv.org/html/2606.30914#bib.bib39)\), context\-aware encoder\(Al Monsur et al\.,[2026](https://arxiv.org/html/2606.30914#bib.bib2)\), and instruction tuning with annotation guidelines to improve schema understanding and cross\-schema generalization\(Srivastava et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib34)\)\. Beyond event extraction, annotation guidelines have also been shown to improve other information extraction tasks, including Named Entity Recognition\(Sainz et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib29)\)and relation extraction\(Pang et al\.,[2023](https://arxiv.org/html/2606.30914#bib.bib25)\)\. However, these methods have almost exclusively been evaluated on clean English benchmarks, leaving the robustness of encoder\-only and decoder\-only models, as well as the impact of annotation guidelines during instruction tuning, under realistic noisy conditions largely unexplored\.
In this work, we address these gaps using Bangla as a representative low\-resource language\. We introduce a generalized Bangla news event ontology, construct a benchmark comprising clean news articles, real\-world ASR transcripts, and simulated orthographic noise, and systematically compare fine\-tuned encoder\-only and instruction\-tuned decoder\-only models\. We further investigate whether annotation guidelines improve model performance and robustness under noisy conditions\.
## 3Benchmark Development
### 3\.1Task Formulation
We evaluate event detection across two distinct modeling paradigms: Token Classification, for encoder\-only architectures and Structured Sequence Generation, for decoder\-only large language models\.
Formally, let an input text sequence of lengthnnbe defined asX=\[x1,x2,…,xn\]X=\[x\_\{1\},x\_\{2\},\\dots,x\_\{n\}\]\. Our ontology consists of a predefined set of event typesℰ\\mathcal\{E\}\. The goal of the event detection system is to extract a set of event instancesYYfromXX\. Each extracted event instance is defined as a tuple\(i,j,t\)\(i,j,t\), where\[xi,…,xj\]\[x\_\{i\},\\dots,x\_\{j\}\]represents the continuous text span that triggers the event, andt∈ℰt\\in\\mathcal\{E\}is the predicted event type\. To systematically evaluate the capabilities of different architectures, the extraction ofYYis formulated through two paradigms:
##### Event Detection as Token Classification\.
Under this traditional sequence\-labeling paradigm, the task is framed as a token\-level classification problem\. A model processes the input textXXand assigns a labelyky\_\{k\}to each token\. The model must simultaneously performTrigger Identification\(locating the boundariesiiandjjof the trigger span\) andTrigger Classification\(mapping the identified span to an event typet∈ℰt\\in\\mathcal\{E\}\)\.
##### Event Detection as Structured Sequence Generation\.
For generative LLMs, the task is reframed as conditional sequence generation\. Given the input sentenceXX, a natural language task instructionII, and the target event schemaEtE\_\{t\}for a specific event typet∈ℰt\\in\\mathcal\{E\}, we construct a unified promptPP:
P=\[I⊕Et⊕X\]P=\[I\\oplus E\_\{t\}\\oplus X\]\(1\)where⊕\\oplusdenotes string concatenation\. The instructionIIdictates the extraction rules and desired structured output format, whileEtE\_\{t\}provides the semantic definition of the target event\. The model is trained to autoregressively generate the structured representation ofYY, explicitly handling both the extraction of valid triggers and the rejection of schemas not present inXX\.
### 3\.2Ontology Development
We adopted the widely used ACE 2005\(Walker, Christopher et al\.,[2006](https://arxiv.org/html/2606.30914#bib.bib37)\)event ontology as the foundation of our schema\. However, several event categories frequently appearing in Bangladeshi news, such as natural or manmade disasters, disease outbreaks, festivals, socio\-economic events, and certain crime\-related events, are not explicitly represented in ACE 2005\.
To identify event types relevant to the local news landscape, we analyzed Bangladesh\-specific event distributions from GDELT\(Leetaru and Schrodt,[2013](https://arxiv.org/html/2606.30914#bib.bib21)\)and examined frequently occurring themes in the GKG corpus between 2023 and 2025\. Specifically, we extracted and ranked CAMEO event codes and GKG themes associated with Bangladesh to understand the types of events commonly reported in local news\.
Based on this analysis, we extended the ACE 2005 ontology with locally relevant event types and initially constructed a schema consisting of 16 top\-level event categories and 70 event sub\-types\. Following the annotation process, event types with fewer than 50 instances were removed to improve class coverage and evaluation reliability\. The final ontology comprises 14 main events and 40 sub\-events\. The ontology details are presented in Appendix[A](https://arxiv.org/html/2606.30914#A1)\.
#### 3\.2\.1Data Collection
##### News Text
To construct the clean text corpus, we utilized the Bangla News Article Dataset \(BNAD\)\(Saad et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib28)\), which contains category\-wise collections of news articles sourced from Bangla Online Newspaper\. We collected articles from three major Bangladeshi newspapers: Dhaka Tribune Bangla, The Daily Ittefaq, and Daily Janakantha\. Articles were sampled from categories aligned with the event categories in our ontology to ensure balanced coverage across diverse domains\. Coverage of topics in the dataset are described in Appendix[A](https://arxiv.org/html/2606.30914#A1)\.
##### ASR Trasncript
To construct our real\-world noisy ASR corpus, we curated publicly available news broadcasts from the official YouTube channels of three of the most popular Bangladeshi television networks: Jamuna TV, Somoy TV, and Independent Television\. We extracted the automatically generated Bangla speech transcripts associated with these broadcasts\. Utilizing these ASR transcripts provides a naturally occurring source of noisy Bangla text\. This data inherently captures colloquial vocabulary, dialectal verb inflections, homophone substitutions, and transcription errors, accurately reflecting the morphological and lexical challenges encountered by event extraction systems in real\-world spoken media\.
### 3\.3Dataset Quality Control
##### Annotation Procedure
To ensure annotation quality and consistency, we first developed a comprehensive annotation guideline based on the proposed ontology\. The guideline provided detailed descriptions of fine\-grained event types and subtypes, representative trigger example, boundary annotation instructions, and resolutions for common edge cases\. Based on these guidelines, we designed a screening test to assess potential annotators\. Ultimately, six annotators who successfully passed the screening were recruited\. The annotation process was conducted using the academic version of Label Studio\(HumanSignal,[2020](https://arxiv.org/html/2606.30914#bib.bib16)\)\.
##### Inter\-Annotator Agreement
Each instance was independently annotated by two annotators\. Inter\-annotator agreement \(IAA\), measured using Cohen’sκ\\kappa, was 0\.72, indicating substantial agreement\.
##### Adjudication and Quality Control
Annotation disagreements were resolved through a two\-stage adjudication process consisting of peer discussion followed by verification by senior annotators\. Annotated Sentences with unresolved ambiguity were excluded from the final dataset\. Overall, approximately 18% of annotated sentences were discarded during the quality control process\.
### 3\.4Dataset Statistics
In total, the finalized benchmark corpus comprises 9,979 annotated sentences, including 5,320 sentences from news articles and 4,659 sentences from ASR transcripts, containing a total of 7,813 event mentions\. Dataset statistics are summarized in Table[2](https://arxiv.org/html/2606.30914#S4.T2)\.
Source\# AnnotatedSentences\# EventMentionsAvg\. Words/InstanceNews Article5320465914\.1ASR Transcript4484340711\.5Total9979781312\.9Table 1:Dataset StatisticsFigure 3:Frequency distribution of the 40 event types in the News and ASR corpora, showing the long\-tail nature of both datasets\. Rankings are computed independently for each corpus\.
### 3\.5Event Class Distribution
Both the News Text and ASR Transcript subsets exhibit a natural long\-tail distribution over the 40 event subtypes, as shown in Figure[3](https://arxiv.org/html/2606.30914#S3.F3)\. To maintain evaluation reliability and ensure comparable coverage, event types with fewer than 50 instances were discarded during ontology refinement\. As a result, the benchmark preserves the realistic class imbalance inherent in real\-world reporting while providing sufficient data support for all retained classes\.
### 3\.6Orthographic Noise Test Data
While our ASR corpus captures transcription errors arising from acoustic ambiguity, pronunciation variation, and missing or incorrectly recognized tokens, these differ substantially from the orthographic noise commonly encountered in user\-generated Bangla text\. Orthographic errors in Bangla are predominantly lexical and character\-level phenomena caused by keyboard proximity, phonetic substitutions, spelling mistakes, and the incorrect handling ofJuktakkhar\(consonant conjuncts\)\. Therefore, robustness against ASR errors does not inherently guarantee robustness against typographic noise\.
Collecting a large corpus of naturally occurring orthographic errors would require additional annotation effort and label re\-verification\. Instead, we simulate typographic noise using the alogorithm proposed bySifat et al\. \([2020](https://arxiv.org/html/2606.30914#bib.bib32)\), which generates realistic Bangla typing errors based on English QWERTY keyboard patterns, including phonetic substitutions, neighboring key errors, insertion patterns, andJuktakkhar\-specific rules\. This approach also enables controlled evaluation of model robustness under varying noise levels while preserving the original event annotations\.
To systematically evaluate model behavior under increasing levels of orthographic degradation, we apply this noise injection process to our Clean Test set, parameterizing the noise injection process using Word Corruption Rate \(WCR\), defined asx∈\{10,20,30,40\}x\\in\\\{10,20,30,40\\\}\. Noise is injected cumulatively, such that each higher WCR level retains all corruptions introduced at lower levels while introducing additional corrupted words until approximatelyx%x\\%of the words are modified\. The remaining context is left unchanged\.
## 4Experiments
### 4\.1Experimental Setup
##### Datasets and Evaluation Splits
We evaluate model performance and robustness across three distinct test sets: theClean Testset, comprising standard Bangla news text; theASR Testset, consisting of automatically generated transcripts from Bangla youtube news videos; and theOrtho Testset, generated from the Clean Test set by simulating realistic orthographic errors at varying Word Corruption Rates \(WCRs\)\. Because sentences frequently contain multiple event triggers following a long\-tail distribution, standard random splitting risks omitting rare classes from the evaluation sets, which would severely compromise the reliability of our evaluation\. To ensure rigorous class\-wise representation, we partition the data using an iterative multi\-label stratification algorithm\(Sechidis et al\.,[2011](https://arxiv.org/html/2606.30914#bib.bib30); Szymański and Kajdanowicz,[2017](https://arxiv.org/html/2606.30914#bib.bib35)\)\. Using this method, the Clean corpus was partitioned into a 70/15/15 ratio \(training/validation/testing\), while the ASR corpus was partitioned into a 70/10/20 ratio\. The final sentence\-level distributions for these datasets are detailed in Table[2](https://arxiv.org/html/2606.30914#S4.T2)\.
DatasetTrain & ValTestNews Text45044504816816ASR Transcripts37053705954954Combined \(Clean \+ ASR\)82098209–Ortho Test–816†816^\{\\dagger\}Table 2:Raw sentence counts across dataset splits\.†Indicates the count per discrete WCR level
##### Models
For the encoder\-only baselines, we employ BanglaBERT \(Base: 110M, Large: 335M\)\(Bhattacharjee et al\.,[2022](https://arxiv.org/html/2606.30914#bib.bib4)\)and XLM\-RoBERTa \(Base: 270M, Large: 550M\)\(Conneau et al\.,[2020](https://arxiv.org/html/2606.30914#bib.bib5)\)\. BanglaBERT represents the state\-of\-the\-art monolingual pretrained model for Bangla NLP, while XLM\-R provides a strong multilingual baseline\.
For the generative LLM baselines, we evaluate several open\-weight LLMs across different parameter scales and multilingual support\. Specifically, we consider models from the Llama family, including Llama 3\.1 8B Instruct and Llama 3\.2 1B Instruct\(Dubey et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib8)\), together with Google’s Gemma 3 4B Instruct and Gemma 3 1B Instruct\(Gemma Team et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib11)\)\.
##### Model Training
For encoder\-only architectures, event detection is formulated directly as a token classification task\. For generative LLMs event detection is reformulated as an instruction\-following task\. We employ a targeted negative sampling \(NS\) approach\. Negative examples provide valuable contrastive supervision for distinguishing event types and suppressing hallucinated predictions\(Srivastava et al\.,[2025](https://arxiv.org/html/2606.30914#bib.bib34)\)\. To assess the impact of annotation guidelines on both model performance and robustness to noisy inputs, we consider two independent prompting settings\. While both settings provide the model with the input sentence and the target event type, they differ in their schema representation\. In thew/ Guidesetting, detailed event definitions are embedded directly within the prompt as Python docstrings\. In contrast, thew/o Guidesetting omits these definitions entirely, requiring the model to perform event detection based solely on the event type name and its underlying parametric knowledge\. In both settings, we instruction\-tune the LLMs on an augmented training set where each sentence’s positive event queries are supplemented with exactly55randomly selected negative samples, yielding a 1:5 positive\-to\-negative ratio\. We use a smaller negative sampling ratio than prior work as a practical trade\-off between computational efficiency and exposure to absent event types\. Details on prompt format and training setup are described in Appendix[B\.1](https://arxiv.org/html/2606.30914#A2.SS1)and[B\.2](https://arxiv.org/html/2606.30914#A2.SS2)\.
##### Evaluation Protocol
FollowingSrivastava et al\. \([2025](https://arxiv.org/html/2606.30914#bib.bib34)\), we adopt a one\-vs\-all inference strategy for generative LLMs, where each test sentence is paired with every event type schema independently, and predictions are aggregated across all 40 types\.
Encoder\-only models naturally learn to identify non\-event tokens through theOlabel in BIO tagging, whereas generative LLMs require explicit negative supervision to correctly abstain when the queried event type is absent\. The one\-vs\-all formulation therefore provides a comparable supervision signal across both architectural paradigms, enabling a fair comparison despite their fundamentally different output formulations\.
##### Evaluation Metric
We select Macro\-F1 as our primary evaluation metric\. As demonstrated byAl Monsur et al\. \([2026](https://arxiv.org/html/2606.30914#bib.bib2)\), prioritizing Micro\-F1 systematically inflates perceived model performance by disproportionately favoring majority classes\. Because Macro\-F1 computes the metric independently for each class before averaging, it equally weights all categories regardless of their support size\.
ArchitectureModelTraining ConditionMacro F1Clean TestASR TestOrtho Test \(Avg\.\)ΔASR\\Delta\_\{\\text\{ASR\}\}ΔOrtho\\Delta\_\{\\text\{Ortho\}\}Decoder\-OnlyGemma 3 4B ITClean w/ Guide60\.5759\.5353\.661\.046\.91Clean w/o Guide55\.1953\.9450\.581\.254\.61Combined w/ Guide65\.6065\.4359\.140\.176\.46Gemma 3 1B ITClean w/ Guide52\.2748\.8145\.113\.467\.16Clean w/o Guide48\.8845\.7544\.313\.134\.57Combined w/ Guide59\.0754\.5850\.404\.498\.67Llama 3\.1 8B ITClean w/ Guide57\.5855\.2354\.571\.653\.01Clean w/o Guide55\.2746\.9249\.868\.355\.41Combined w/ Guide64\.9566\.3961\.60\-1\.443\.35Llama 3\.2 1B ITClean w/ Guide54\.6448\.0748\.386\.576\.26Clean w/o Guide47\.0343\.8940\.423\.146\.61Combined w/ Guide56\.3759\.3351\.93\-2\.964\.44Encoder\-OnlyBanglaBERT BaseClean67\.3556\.7951\.358\.0215\.42Combined68\.4968\.5852\.02\-0\.0916\.47BanglaBERT LargeClean68\.9158\.3450\.2510\.5718\.66Combined68\.8469\.3852\.41\-0\.5416\.43XLM\-R BaseClean67\.3555\.4658\.5411\.898\.81Combined68\.2966\.2058\.802\.099\.49XLM\-R LargeClean70\.6156\.9261\.4313\.699\.18Combined71\.1669\.3462\.331\.828\.83
Table 3:Macro\-F1 performance of encoder and decoder architectures across clean and noisy test sets\.ΔASR\\Delta\_\{\\mathrm\{ASR\}\}andΔOrtho\\Delta\_\{\\mathrm\{Ortho\}\}denote the absolute performance degradation relative to the clean test set \(Clean−Noisy\\text\{Clean\}\-\\text\{Noisy\}\)\. SmallerΔ\\Deltavalues indicate greater robustness, while negative values indicate improved performance on the noisy test set compared with the clean baseline\.
### 4\.2Evaluation of Language Models
##### LLMs exhibit greater robustness to noise than encoder\-based models\.
To quantify robustness, we measure the performance degradation due to noise, denoted byΔF1\\Delta\\mathrm\{F1\}, as the difference between a model’s Macro F1 score on clean test set \(F1Clean\\mathrm\{F1\}\_\{\\text\{Clean\}\}\) and its average performance under noisy conditions \(F1Noisy\\mathrm\{F1\}\_\{\\text\{Noisy\}\}\)\. To balance the influence of real\-world ASR transcription errors and simulated orthographic noise, we defineF1Noisy=12\(F1ASR\+F1Ortho\-Avg\)\\mathrm\{F1\}\_\{\\text\{Noisy\}\}=\\frac\{1\}\{2\}\(\\mathrm\{F1\}\_\{\\text\{ASR\}\}\+\\mathrm\{F1\}\_\{\\text\{Ortho\-Avg\}\}\), whereF1Ortho\-Avg\\mathrm\{F1\}\_\{\\text\{Ortho\-Avg\}\}denotes the average Macro F1 across all word corruption rates\. The overall robustness drop is then computed asΔF1=F1Clean−F1Noisy\\Delta\\mathrm\{F1\}=\\mathrm\{F1\}\_\{\\text\{Clean\}\}\-\\mathrm\{F1\}\_\{\\text\{Noisy\}\}\.
The results, as shown in Figure[4](https://arxiv.org/html/2606.30914#S4.F4), reveal a clear distinction between encoder\-based models and generative LLMs\. Although encoder models achieve stronger performance on clean test set, they are substantially more sensitive to input corruption\. For example, BanglaBERT large attains the highest clean\-text Macro F1 score \(68\.9\) but suffers a severe degradation of 14\.6 points under noisy conditions\. Similarly, XLM\-R large experiences an average drop of 11\.4 points\. In contrast, generative LLMs exhibit considerably greater resilience\. Llama 3\.1 8B instruction tuned \(w/ Guide\) achieves a Macro F1 score of 57\.6 on the clean test set while incurring an average performance drop of only 2\.7 points under noisy conditions\. Similarly, Gemma 3 4B IT \(w/ Guide\) experiences a degradation of merely 4\.0 points\. These results indicate a tradeoff between peak performance and robustness\. Although encoder\-based models trained exclusively on clean news text achieve higher Macro F1 scores on the clean test set, they experience substantial performance degradation under both orthographic and real\-world ASR transcription noise\. In contrast, generative LLMs exhibit considerably greater resilience and maintain more stable performance across noisy conditions\.
Figure 4:Decomposition of Macro\-F1 performance across model architectures and training conditions into retained performance and average noise\-induced degradation\. The total height of each bar corresponds to the Clean Test Macro\-F1 score\. The solid region represents the average Macro\-F1 retained across the ASR and simulated orthographic test sets, while the hatched region indicates the corresponding average performance drop relative to the clean test set\.
##### Instruction tuning with annotation guidelines improves performance but yields inconsistent reductions in clean\-to\-noisy performance drop
As shown in Figure[5](https://arxiv.org/html/2606.30914#S4.F5)and Table[3](https://arxiv.org/html/2606.30914#S4.T3), across all evaluated LLM families, instruction tuning w/ Guide approach consistently improves performance on both clean and ASR test set and leads to higher Macro F1 score, even though the models are trained exclusively on clean news text\. For example, w/ Guide approach increases the clean\-test Macro F1 score of Gemma 3 4B from 55\.2 to 60\.6, with similar improvements observed across other models\. However, the effect of annotation guideline on mitigating the relative performance drop \(ΔF1\\Delta\\mathrm\{F1\}\) is inconsistent\. For several models, w/ Guide approach slightly increase the magnitude of degradation; the average robustness drop for Gemma 3 4B widens from 2\.9 points w/o Guide to 4\.0 points with w/ Guide, while for Llama 3\.2 1B, it increases from 4\.9 to 6\.4 points\. In contrast, Llama 3\.1 8B, benefits substantially from instruction tuning w/ Guide approach, with the average degradation decreasing from 6\.9 points to only 2\.6 points\. These findings suggest that although annotation guidelines improve task understanding and overall performance, they only marginally reduce the performance degradation caused by noise\.


Figure 5:Comparison of performance degradation between Generative LLMs \(left\) and Encoder architectures \(right\) under varying levels of simulated orthographic noise\.Figure 6:Pairwise McNemar’s significance tests atWCR=40%\\mathrm\{WCR\}=40\\%comparing decoder\-only LLMs and encoder\-only models on event detection\. Results are reported separately for trigger corruption \(left\) and context corruption \(right\)\. Blue indicates that the LLM significantly outperforms the encoder, red indicates the converse, and white denotes no statistically significant difference \(p∗<0\.05\{\}^\{\*\}p<0\.05,p∗∗<0\.01\{\}^\{\*\*\}p<0\.01,p∗∗∗<0\.001\{\}^\{\*\*\*\}p<0\.001\)\. All models are trained on clean news text and evaluated on previously unseen orthographically corrupted inputs\.
##### Noise Combined training enhances encoder robustness
Training on a mixture of clean and corrupted samples \(Clean \+ Noisy ASR\) improves robustness for both encoder\-based models and generative LLMs\. However, the magnitude of these gains differs considerably across architectures\. For example, BanglaBERT Large’s average performance drop \(ΔF1\\Delta\\mathrm\{F1\}\) shrinks dramatically from 14\.6 points to only 7\.9 points\. Similarly, for XLM\-R shrinks degradation from 11\.4 points to 5\.3 points, with not notable increase in clean test performance for both\. In contrast, generative LLMs benefit in both dimensions\. For instance, Gemma 3 4B \(w/ Guide\) not only improves its clean\-test performance but also reduces its average performance drop from 4\.0 to 2\.8 points, while Llama 3\.1 8B \(w/ Guide\) simultaneously achieves higher clean\-test score and further decreases its degradation from 2\.6 to 1\.7 points\. These findings reveal a fundamental distinction between the two paradigms\. Noise augmentation is particularly beneficial for encoder architectures, significantly narrowing the robustness gap between encoder models and generative LLMs\.
##### Model Scaling Benefits Decoder\-Only LLMs More Than Encoders
An analysis of parameter scaling reveals a clear architectural divergence in robustness to noisy input conditions\. Within decoder\-only models, increasing model size consistently reduces the performance degradation induced by noise\. For example, scaling from Llama 3\.2 1B to Llama 3\.1 8B decreases the average noise\-induced Macro\-F1 drop from 6\.4 to 2\.7 points under clean training w/ Guide\. Similar trends are observed for the Gemma family, where the 4B model consistently incurs a smaller robustness penalty than its 1B counterpart\. In contrast, scaling encoder models from Base to Large provides only marginal improvements in clean performance while failing to consistently improve robustness\. Under clean training, BanglaBERT\-Large and XLM\-R\-Large experience larger average performance drops than their Base counterparts \(14\.6 vs\. 13\.3 and 11\.4 vs\. 10\.4 points, respectively\), despite achieving higher clean\-test performance\. These results suggest that increased model capacity substantially improves the robustness of decoder\-only architectures to noise, whereas simply scaling encoder architectures yields limited robustness gains\. We hypothesize that larger decoder\-only models better exploit semantic and contextual information to compensate for corrupted lexical forms, while larger encoder models remain comparatively sensitive to lexical perturbations despite their increased capacity\.
##### Progressive orthographic noise induces monotonic performance decline, modulated by training strategy\.
Across increasing Word Corruption Rates \(WCR\), both generative LLMs and encoder\-based architectures exhibit a monotonic decline in Macro F1 scores as simulated orthographic noise increases from 0% to 40% \(Figure[5](https://arxiv.org/html/2606.30914#S4.F5)\)\. However, the severity of this degradation varies considerably across models and training strategies\. Combined training consistently improves robustness by flattening the performance degradation curve, particularly at higher corruption levels\. This effect is most pronounced for encoder\-based models, where combined training substantially mitigates the steep decline observed in models trained exclusively on clean text\. For generative LLMs, combined training provides an additional benefit by not only improving robustness but also increasing absolute performance across most corruption levels\. Across all evaluated model families, w/ Guide variants consistently outperform their w/o Guide counterparts on both clean and corrupted inputs\.
##### LLMs Excel Under Trigger Corruption, Encoders Under Context Corruption
To complement the aggregate performance analysis, we perform paired statistical significance testing using McNemar’s exact test at WCR = 40%, the most challenging simulated orthographic corruption setting\. Focusing on the highest corruption level allows us to determine whether performance differences remain statistically significant under severe noise\. All models are trained on clean news text and evaluated on previously unseen orthographically corrupted inputs\.
Predictions are aligned at the sentence level and partitioned according to whether corruption affects the event trigger itself or only the surrounding context\. For decoder\-only LLMs, generated trigger mentions are deterministically mapped back to their corresponding token spans in the original sentence through exact string matching, after which predictions are converted into*\(event type, start index, end index\)*tuples identical to the encoder outputs\.
For trigger corruption, a prediction is considered successful if the model correctly identifies the corrupted trigger span and assigns the correct event type despite the trigger word itself being orthographically corrupted\. For context corruption, success requires correctly identifying the intact trigger span and its event type while remaining robust to orthographic noise introduced only in the surrounding context\.
For each subset, predictions are converted into binary correct/incorrect outcomes based on strict trigger identification, where a prediction is considered correct only if both the trigger boundaries and event type exactly match the gold annotation\. McNemar’s exact test is then applied to the paired contingency table\. Since McNemar’s test evaluates only discordant prediction pairs, cases where both models are simultaneously correct or simultaneously incorrect do not contribute to the test statistic\. Consequently, the reported significance reflects whether one model consistently succeeds on examples where the other fails\. Overall, the results reveal a clear distinction between robustness to trigger corruption and context corruption\. When the event trigger itself is orthographically corrupted, decoder\-only LLMs, particularly the larger models, consistently outperform encoder\-based architectures, indicating a significantly greater ability to correctly identify corrupted trigger mentions and assign the appropriate event type despite lexical perturbations\. In contrast, when corruption is confined to the surrounding context while the trigger remains intact, the performance gap narrows considerably\. Most model comparisons are no longer statistically significant, and among the significant cases, XLM\-R frequently outperforms the smaller LLMs\. This suggests that encoder\-based models remain highly effective at exploiting contextual token representations when the trigger is preserved, whereas the principal advantage of decoder\-only LLMs lies in maintaining accurate trigger identification under direct lexical corruption rather than in handling noisy contextual information alone\.
## 5Conclusion
In this paper, we presented the first comprehensive evaluation of language model robustness for Bangla event detection under both real\-world ASR transcription errors and simulated orthographic noise\. To facilitate this, we introduced a generalized Bangla news event ontology and released a benchmark comprising 9,979 human annotated sentences spanning clean news articles, ASR transcripts, and synthetic orthographic corruption\. Through a systematic comparison of encoder\-only and decoder\-only architectures, we demonstrate a clear architectural trade\-off: while encoder models achieve superior performance on clean text through precise trigger localization, they experience substantially greater degradation under lexical corruption\. In contrast, decoder\-only LLMs exhibit markedly stronger robustness, particularly when the event trigger itself is corrupted\.
Our analyses further show that increasing model scale substantially improves the robustness of decoder\-only LLMs, whereas scaling encoder architectures provides limited robustness gains despite improving clean\-text performance\. We also find that embedding annotation guidelines during instruction tuning establishes a higher performance baseline but yields inconsistent improvements in robustness to noisy inputs\. In contrast, combined training on clean and noisy text consistently mitigates performance degradation, with particularly large gains for encoder\-based models\. Overall, our findings highlight the importance of evaluating event detection systems beyond clean\-text benchmarks and suggest that decoder\-only LLMs, together with combined training, provide a promising direction for robust event detection in real\-world, low\-resource settings\.
## Limitations
Although we carefully designed our experiments to ensure a fair comparison between encoder\-based models and decoder\-only LLMs, our study has several limitations\.
##### Strict Span\-Based Evaluation\.
Our evaluation relies on the Strict Macro\-F1 metric, which requires an exact match between the predicted and gold event trigger spans and event types\. While this provides a rigorous assessment of trigger localization, it may disproportionately penalize generative LLMs\. Encoder models are naturally formulated as sequence\-labeling systems optimized for token\-level BIO tagging, enabling precise prediction of trigger boundaries\. In contrast, decoder\-only LLMs generate event mentions autoregressively and occasionally produce semantically correct trigger mentions with slightly different surface boundaries \(for instance, leaving out or adding a postposition or inflectional marker in Bangla\)\. Such predictions are treated as incorrect under the strict evaluation protocol even when they capture the correct event\. Consequently, the reported results should be interpreted as measuring exact trigger identification rather than semantic event understanding\.
##### Domain and Task Scope\.
This study is restricted to the news domain, using datasets constructed from formal news articles and their same\-domain ASR transcripts\. Consequently, the reported robustness characteristics may not generalize to general\-domain event detection, which may involve other event types and exhibit substantially different linguistic properties and noise distributions\. Furthermore, our investigation is limited to event detection, namely identifying event trigger mentions and classifying their event types, as annotating event arguments, participant roles, temporal expressions, and other event attributes would require substantial annotation effort and computational resources\. Future research should investigate whether the robustness trends observed for event detection extend to argument extraction and complete event extraction pipelines\.
## Ethical Considerations
Annotators were compensated for their work at a rate consistent with local standards for similar annotation tasks\. Prior to participation, annotators were informed about the purpose of the study and how their contributions would be used in the dataset\. All news articles and Automatic Speech Recognition \(ASR\) transcripts were collected from publicly available channels\. The dataset was compiled strictly for non\-commercial, academic research purposes under standard fair use principles\. Any named entities or individuals present in the text are public figures or subjects of public record, and no sensitive Personally Identifiable Information \(PII\) of private citizens was targeted or exposed\.
## References
- Ahn \(2006\)David Ahn\. 2006\.The stages of event extraction\.In*Proceedings of the Workshop on Annotating and Reasoning about Time and Events*, pages 1–8, Sydney, Australia\. Association for Computational Linguistics\.
- Al Monsur et al\. \(2026\)Abdullah Al Monsur, Nitesh Vamshi Bommisetty, and Gene Louis Kim\. 2026\.[Event Detection with a Context\-Aware Encoder and LoRA for Improved Performance on Long\-Tailed Classes](https://doi.org/10.18653/v1/2026.findings-eacl.314)\.In*Findings of the Association for Computational Linguistics: EACL 2026*, pages 5985–6003, Rabat, Morocco\. Association for Computational Linguistics\.
- Ali Khandokar et al\. \(2025\)Iftakhar Ali Khandokar, Abdullah All Tanvir, Md\. Saddam Hossain Mukta, and Swakkhar Shatabda\. 2025\.[Temporal, Demographic, and Geographical Analysis of Violent Events in Bangla News Media Using NLP Techniques](https://doi.org/10.1007/s44230-025-00092-8)\.*Human\-Centric Intelligent Systems*, 5\(1\):90–102\.
- Bhattacharjee et al\. \(2022\)Abhik Bhattacharjee, Tahmid Hasan, Wasi Ahmad, Kazi Samin Mubasshir, Md Saiful Islam, Anindya Iqbal, M\. Sohel Rahman, and Rifat Shahriyar\. 2022\.[BanglaBERT: Language Model Pretraining and Benchmarks for Low\-Resource Language Understanding Evaluation in Bangla](https://doi.org/10.18653/v1/2022.findings-naacl.98)\.In*Findings of the Association for Computational Linguistics: NAACL 2022*, pages 1318–1327, Seattle, United States\. Association for Computational Linguistics\.
- Conneau et al\. \(2020\)Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov\. 2020\.[Unsupervised Cross\-lingual Representation Learning at Scale](https://doi.org/10.18653/v1/2020.acl-main.747)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 8440–8451, Online\. Association for Computational Linguistics\.
- Dave et al\. \(2021\)Bhargav Dave, Surupendu Gangopadhyay, Prasenjit Majumder, Pushpak Bhattacharya, Sudeshna Sarkar, and Sobha Lalitha Devi\. 2021\.[FIRE 2020 EDNIL Track: Event Detection from News in Indian Languages](https://doi.org/10.1145/3441501.3441516)\.In*Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation*, FIRE ’20, pages 25–28, New York, NY, USA\. Association for Computing Machinery\.
- Dey et al\. \(2021\)Noyon Dey, Md\. Sazzadur Rahman, Motahara Sabah Mredula, A\. S\. M\. Sanwar Hosen, and In\-Ho Ra\. 2021\.[Using Machine Learning to Detect Events on the Basis of Bengali and Banglish Facebook Posts](https://doi.org/10.3390/electronics10192367)\.*Electronics*, 10\(19\):2367\.
- Dubey et al\. \(2024\)Abhimanyu Dubey, Abhinav Jauhri, and et al\. 2024\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*\.
- Ebner et al\. \(2020\)Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme\. 2020\.[Multi\-Sentence Argument Linking](https://doi.org/10.18653/v1/2020.acl-main.718)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 8057–8077, Online\. Association for Computational Linguistics\.
- Elahi et al\. \(2024\)Kazi Toufique Elahi, Tasnuva Binte Rahman, Shakil Shahriar, Samir Sarker, Md Tanvir Rouf Shawon, and G\. M\. Shahariar\. 2024\.[A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts](https://doi.org/10.48550/arXiv.2401.14360)\.*Preprint*, arXiv:2401\.14360\.
- Gemma Team et al\. \(2025\)Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean\-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 193 others\. 2025\.[Gemma 3 Technical Report](https://doi.org/10.48550/ARXIV.2503.19786)\.*arXiv preprint*\.
- Hong et al\. \(2011\)Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaoming Zhu\. 2011\.Using Cross\-Entity Inference to Improve Event Extraction\.In*Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies*, pages 1127–1136, Portland, Oregon, USA\. Association for Computational Linguistics\.
- Hossain et al\. \(2025\)Md\. Mithun Hossain, Sanjara, Md\. Shakil Hossain, Sudipto Chaki, Md\. Saifur Rahman, and A B M Shawkat Ali\. 2025\.[MaskNet: Enhancing Crime Event Detection with Feature Masking and Dynamic Attention](https://doi.org/10.1109/NCIM65934.2025.11160104)\.In*2025 2nd International Conference on Next\-Generation Computing, IoT and Machine Learning \(NCIM\)*, pages 1–6\.
- Hu et al\. \(2021\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\. 2021\.[LoRA: Low\-Rank Adaptation of Large Language Models](https://doi.org/10.48550/arXiv.2106.09685)\.*Preprint*, arXiv:2106\.09685\.
- Huang et al\. \(2024\)Kuan\-Hao Huang, I\-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Prem Natarajan, Kai\-Wei Chang, Nanyun Peng, and Heng Ji\. 2024\.[TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction](https://doi.org/10.18653/v1/2024.findings-acl.760)\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 12804–12825, Bangkok, Thailand\. Association for Computational Linguistics\.
- HumanSignal \(2020\)HumanSignal\. 2020\.Label Studio: Data labeling software\.Available at[https://labelstud\.io](https://labelstud.io/)\.
- Islam et al\. \(2021\)Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Islam, and Mohammad Ruhul Amin\. 2021\.[SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts](https://doi.org/10.18653/v1/2021.findings-emnlp.278)\.In*Findings of the Association for Computational Linguistics: EMNLP 2021*, pages 3265–3271, Punta Cana, Dominican Republic\. Association for Computational Linguistics\.
- Khandokar et al\. \(2020\)Iftakhar Ali Khandokar, Imtiaz Mamun, Tasmia Ishrat Alam Chadni, Zubair Ahmed Anas, and Swakkhar Shatabda\. 2020\.[Event Detection and Knowledge Mining from Unlabelled Bengali News Articles](https://doi.org/10.1109/ETCCE51779.2020.9350891)\.In*2020 Emerging Technology in Computing, Communication and Electronics \(ETCCE\)*, pages 1–6, Bangladesh\. IEEE\.
- Kim et al\. \(2009\)Jin\-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii\. 2009\.Overview of BioNLP’09 Shared Task on Event Extraction\.In*Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task*, pages 1–9, Boulder, Colorado\. Association for Computational Linguistics\.
- Le and Nguyen \(2021\)Duong Le and Thien Huu Nguyen\. 2021\.[Fine\-Grained Event Trigger Detection](https://doi.org/10.18653/v1/2021.eacl-main.237)\.In*Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume*, pages 2745–2752, Online\. Association for Computational Linguistics\.
- Leetaru and Schrodt \(2013\)Kalev Leetaru and Philip Schrodt\. 2013\.Gdelt: Global data on events, language, and tone, 1979\-2012\.In*International Studies Association Annual Conference*, San Francisco, CA\.
- Li et al\. \(2013\)Qi Li, Heng Ji, and Liang Huang\. 2013\.Joint Event Extraction via Structured Prediction with Global Features\.In*Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 73–82, Sofia, Bulgaria\. Association for Computational Linguistics\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\. 2019\.[Decoupled Weight Decay Regularization](https://doi.org/10.48550/arXiv.1711.05101)\.*Preprint*, arXiv:1711\.05101\.
- Nguyen et al\. \(2021\)Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen\. 2021\.[Trankit: A Light\-Weight Transformer\-based Toolkit for Multilingual Natural Language Processing](https://doi.org/10.18653/v1/2021.eacl-demos.10)\.In*Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations*, pages 80–90, Online\. Association for Computational Linguistics\.
- Pang et al\. \(2023\)Chaoxu Pang, Yixuan Cao, Qiang Ding, and Ping Luo\. 2023\.[Guideline Learning for In\-Context Information Extraction](https://doi.org/10.18653/v1/2023.emnlp-main.950)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 15372–15389, Singapore\. Association for Computational Linguistics\.
- Patwardhan and Riloff \(2009\)Siddharth Patwardhan and Ellen Riloff\. 2009\.A Unified Model of Phrasal and Sentential Evidence for Information Extraction\.In*Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing*, pages 151–160, Singapore\. Association for Computational Linguistics\.
- Pouran Ben Veyseh et al\. \(2022\)Amir Pouran Ben Veyseh, Minh Van Nguyen, Franck Dernoncourt, and Thien Nguyen\. 2022\.[MINION: A Large\-Scale and Diverse Dataset for Multilingual Event Detection](https://doi.org/10.18653/v1/2022.naacl-main.166)\.In*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 2286–2299, Seattle, United States\. Association for Computational Linguistics\.
- Saad et al\. \(2024\)Asif Mohammed Saad, Umme Niraj Mahi, Md\. Shahidul Salim, and Sk Imran Hossain\. 2024\.[Bangla news article dataset](https://doi.org/10.1016/j.dib.2024.110874)\.*Data in Brief*, 57:110874\.
- Sainz et al\. \(2024\)Oscar Sainz, Iker García\-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, and Eneko Agirre\. 2024\.[GoLLIE: Annotation Guidelines improve Zero\-Shot Information\-Extraction](https://doi.org/10.48550/arXiv.2310.03668)\.*Preprint*, arXiv:2310\.03668\.
- Sechidis et al\. \(2011\)Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas\. 2011\.On the stratification of multi\-label data\.*Machine Learning and Knowledge Discovery in Databases*, pages 145–158\.
- Sharif et al\. \(2024\)Omar Sharif, Joseph Gatto, Madhusudan Basak, and Sarah Masud Preum\. 2024\.[Explicit, Implicit, and Scattered: Revisiting Event Extraction to Capture Complex Arguments](https://doi.org/10.18653/v1/2024.emnlp-main.673)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 12061–12081, Miami, Florida, USA\. Association for Computational Linguistics\.
- Sifat et al\. \(2020\)Md Habibur Rahman Sifat, Chowdhury Rafeed Rahman, Mohammad Rafsan, and Md Hasibur Rahman\. 2020\.[Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern](https://doi.org/10.48550/arXiv.2003.03484)\.*Preprint*, arXiv:2003\.03484\.
- Sims et al\. \(2019\)Matthew Sims, Jong Ho Park, and David Bamman\. 2019\.[Literary Event Detection](https://doi.org/10.18653/v1/P19-1353)\.In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 3623–3634, Florence, Italy\. Association for Computational Linguistics\.
- Srivastava et al\. \(2025\)Saurabh Srivastava, Sweta Pati, and Ziyu Yao\. 2025\.[Instruction\-Tuning LLMs for Event Extraction with Annotation Guidelines](https://doi.org/10.18653/v1/2025.findings-acl.677)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 13055–13071, Vienna, Austria\. Association for Computational Linguistics\.
- Szymański and Kajdanowicz \(2017\)Piotr Szymański and Tomasz Kajdanowicz\. 2017\.A network perspective on stratification of multi\-label data\.In*Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications*, volume 74 of*Proceedings of Machine Learning Research*, pages 22–35, ECML\-PKDD, Skopje, Macedonia\. PMLR\.
- Touileb et al\. \(2024\)Samia Touileb, Jeanett Murstad, Petter Mæhlum, Lubos Steskal, Lilja Charlotte Storset, Huiling You, and Lilja Øvrelid\. 2024\.EDEN: A Dataset for Event Detection in Norwegian News\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pages 5495–5506, Torino, Italia\. ELRA and ICCL\.
- Walker, Christopher et al\. \(2006\)Walker, Christopher, Strassel, Stephanie, Medero, Julie, and Maeda, Kazuaki\. 2006\.[ACE 2005 Multilingual Training Corpus](https://doi.org/10.35111/MWXC-VH88)\.
- Wang et al\. \(2020\)Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou\. 2020\.[MAVEN: A Massive General Domain Event Detection Dataset](https://doi.org/10.18653/v1/2020.emnlp-main.129)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 1652–1671, Online\. Association for Computational Linguistics\.
- Wang et al\. \(2023\)Xingyao Wang, Sha Li, and Heng Ji\. 2023\.[Code4Struct: Code Generation for Few\-Shot Event Structure Prediction](https://doi.org/10.48550/arXiv.2210.12810)\.*Preprint*, arXiv:2210\.12810\.
- Yao et al\. \(2022\)Feng Yao, Chaojun Xiao, Xiaozhi Wang, Zhiyuan Liu, Lei Hou, Cunchao Tu, Juanzi Li, Yun Liu, Weixing Shen, and Maosong Sun\. 2022\.[LEVEN: A Large\-Scale Chinese Legal Event Detection Dataset](https://doi.org/10.18653/v1/2022.findings-acl.17)\.In*Findings of the Association for Computational Linguistics: ACL 2022*, pages 183–201, Dublin, Ireland\. Association for Computational Linguistics\.
## Appendix AData Collection and Ontology Development
### A\.1Topic Coverage
To ensure comparable topical coverage across clean text and ASR transcripts, we curated both corpora from the same set of news domains\. For the clean text corpus, we selected articles from the BNAD\(Saad et al\.,[2024](https://arxiv.org/html/2606.30914#bib.bib28)\)dataset, specifically using content from three major Bangladeshi newspapers:Dhaka Tribune Bangla,The Daily Ittefaq, andDaily Janakanthas\. We focused on seven major news domains:
- •\\banglafont রাজনীতি \(Politics\)
- •\\banglafont অর্থনীতি \(Economy\)
- •\\banglafont স্বাস্থ্য \(Health\)
- •\\banglafont শিক্ষা \(Education\)
- •\\banglafont প্রযুক্তি/টেক/বিজ্ঞান ও প্রযুক্তি \(Technology/Science\)
- •\\banglafont বাণিজ্য \(Commerce\)
- •
For the ASR corpus, we curated publicly available Bangla news broadcasts from the official YouTube channels ofJamuna TV,Somoy TV, andIndependent Television, and extracted their automatically generated Bangla speech transcripts\. To mirror the clean text corpus, we collected broadcasts from category\-specific playlists corresponding to the same seven news domains whenever available\.
### A\.2Ontology Development
To construct a generalized Bangla news event ontology, we analyzed a large\-scale corpus of Bangladeshi news using the Global Database of Events, Language, and Tone \(GDELT\) through Google BigQuery and queried the Global Knowledge Graph \(GKG\) to extract historical event records related to Bangladesh from 2023 to 2025\. Based on this corpus analysis, we identified event phenomena that are prevalent in Bangladeshi news and introduced locally relevant event types and subtypes to complement the widely adopted ACE 2005 ontology\. This process ensured compatibility with established event types while capturing events that are underrepresented or absent in ACE 2005 but frequently occur in Bangladeshi news\. The resulting ontology comprises 40 event subtypes organized into generalized event categories\. The complete list of event types and subtypes, together with their source \(ACE 2005 or GDELT\-informed\), is presented in Table[4](https://arxiv.org/html/2606.30914#A1.T4)\.
Figure 7:Overview of the ontology construction process\. We analyze historical Bangladeshi news events from GDELT using Google BigQuery and the Global Knowledge Graph \(GKG\), adapt relevant event types, and integrate them with ACE 2005 to construct a generalized Bangla news event ontology\.Table 4:Event types and sub types in the datasetEvent TypeSub TypeSourceContactMeetACE 2005Phone\-WriteACE 2005ConflictAttackACE 2005DemonstrateACE 2005CrimeCommit\-Blue\-Collar\-CrimeGDELTCommit\-White\-Collar\-CrimeGDELTDisasterOccur\-Man\-Made\-DisasterGDELTOccur\-Natural\-DisasterGDELTFestivalCelebrate\-Cultural\-FestivalGDELTObserve\-Religious\-FestivalGDELTGovernanceApproveGDELTBanGDELTDecideGDELTGazetteGDELTInvestigateGDELTReformGDELTSupportGDELTHealthOutbreakGDELTServe\-PatientsGDELTJusticeArrest\-JailACE 2005Charge\-IndictACE 2005Deliver\-VerdictACE 2005SueACE 2005Trial\-HearingACE 2005LifeDieACE 2005InjureACE 2005MovementTransportACE 2005PersonnelElectACE 2005End\-PositionACE 2005Start\-PositionACE 2005Socio\-EconomyDisruptGDELTGraduateGDELTGrowGDELTImport\-ExportGDELTInvestGDELTRecruitGDELTTradeGDELTTechnologyLaunch\-ServiceGDELTTransactionTransfer\-MoneyACE 2005Transfer\-OwnershipACE 2005Figure 8:Distribution of event instances across the 40 event subtypes in the News and ASR corpora\. Both corpora exhibit similar long\-tail distributions, reflecting our topic\-balancing effort despite the ASR corpus containing naturally occurring transcription artifacts and out\-of\-distribution vocabulary\.
## Appendix BPrompt Design and Model Training
### B\.1Instruction Tuning Prompt Formats
The following example illustrates the structured code generation prompt under thew/ Guidesetting, where fine\-grained annotation guidelines \(event definitions, trigger examples, and boundary constraints\) are embedded directly within the schema representation as Python docstrings\. English translations are provided in brackets for reference\.
Instruction Tuning Prompt \(w/ Guide\)[⬇](data:text/plain;base64,IyBJbnN0cnVjdGlvbgpFeHRyYWN0IGV2ZW50cyBhY2NvcmRpbmcgdG8gdGhlIHNjaGVtYS4KCiMgU2NoZW1hCkBkYXRhY2xhc3MKY2xhc3MgSW5qdXJlKExpZmUpOgogICAgIiIiCiAgICAoKkBccGFyYm94ezAuODVcbGluZXdpZHRofXtcYmFuZ2xhe+Cmj+CmhyDgpofgpq3gp4fgpqjgp43gpp/gpp/gpr8g4Kak4KaW4KaoIOCmmOCmn+CnhyDgpq/gppbgpqgg4KaF4Kao4Ka/4Kaa4KeN4Kab4Ka+4KaV4KeD4KakL+CmuOCnjeCmrOCnh+CmmuCnjeCmm+CmvuCmr+CmvC/gpqbgp4HgprDgp43gppjgpp/gpqjgpr7gpqzgprbgpqQg4KaV4KeL4KaoIOCmrOCnjeCmr+CmleCnjeCmpOCmvyDgprbgpr7gprDgp4DgprDgpr/gppXgpq3gpr7gpqzgp4cg4KaG4KaY4Ka+4KakIOCmquCmvuCmqOClpH1cXCBcdGV4dGNvbG9ye3RyYW5zZ3JheX17KFRoaXMgZXZlbnQgb2NjdXJzIHdoZW4gYSBwZXJzb24gaXMgcGh5c2ljYWxseSBpbmp1cmVkIHVuaW50ZW50aW9uYWxseS92b2x1bnRhcmlseS9hY2NpZGVudGFsbHkuKX19QCopCgogICAgKCpAXHBhcmJveHswLjg1XGxpbmV3aWR0aH17XGJhbmdsYXvgpongpqbgpr7gprngprDgpqMg4Kaf4KeN4Kaw4Ka/4KaX4Ka+4KawOiDgpobgppjgpr7gpqQsIOCmleCnjeCmt+CmpOCmvywg4KaG4Ka54KakLCDgpoXgppngp43gppfgprngpr7gpqjgpr8g4KaH4Kak4KeN4Kav4Ka+4Kam4Ka/4KWkfVxcIFx0ZXh0Y29sb3J7dHJhbnNncmF5fXsoRXhhbXBsZSB0cmlnZ2VyczogaW5qdXJlLCBoYXJtLCBpbmp1cmVkLCBkaXNtZW1iZXJtZW50LCBldGMuKX19QCopCgogICAgKCpAXGJhbmdsYXvgprjgpqTgprDgp43gppXgpqTgpr46fSBcdGV4dGNvbG9ye3RyYW5zZ3JheX17KE5vdGU6KX1AKikKICAgICgqQFxwYXJib3h7MC44NVxsaW5ld2lkdGh9e1xiYW5nbGF7LSDgpobgppjgpr7gpqTgpqrgp43gprDgpr7gpqrgp43gpqQg4Kas4KeN4Kav4KaV4KeN4Kak4Ka/IOCmruCmvuCmsOCmviDgppfgp4fgprLgp4cg4Kak4Ka+IOCmj+CmhyDgpofgpq3gp4fgpqjgp43gpp/gp4fgprAg4KaF4Kao4KeN4Kak4Kaw4KeN4Kat4KeB4KaV4KeN4KakIOCmueCmrOCnhyDgpqjgpr7gpaR9XFwgXHRleHRjb2xvcnt0cmFuc2dyYXl9eyhJZiB0aGUgaW5qdXJlZCBwZXJzb24gZGllcywgaXQgd2lsbCBub3QgYmUgaW5jbHVkZWQgaW4gdGhpcyBldmVudC4pfX1AKikKICAgICIiIgogICAgbWVudGlvbjogc3RyCgojIFNlbnRlbmNlCnRleHQgPSAiKCpAXGJhbmdsYXvgpobgprngpqQg4Ka24Ka/4KaV4KeN4Ka34Ka+4Kaw4KeN4Kal4KeA4KaV4KeHIOCmieCmpuCnjeCmp+CmvuCmsCDgppXgprDgpr4g4Ka44Kau4KeN4Kat4KasIOCmueCmr+CmvOCnh+Cmm+Cnh+ClpH1AKikiCiAgICAgICAoKkBcdGV4dGNvbG9ye3RyYW5zZ3JheX17KFRoZSBpbmp1cmVkIHN0dWRlbnQgd2FzIHN1Y2Nlc3NmdWxseSByZXNjdWVkLil9QCopCgpyZXN1bHQgPQoKIyBPdXRwdXQKWwogICAgSW5qdXJlKG1lbnRpb249IigqQFxiYW5nbGF74KaG4Ka54KakfUAqKSIpICgqQFx0ZXh0Y29sb3J7dHJhbnNncmF5fXsoImluanVyZWQiKX1AKikKXQ==)Extracteventsaccordingtotheschema\.@dataclassclassInjure\(Life\):"""\\banglafontএই ইভেন্টটি তখন ঘটে যখন অনিচ্ছাকৃত/স্বেচ্ছায়/দুর্ঘটনাবশত কোন ব্যক্তি শারীরিকভাবে আঘাত পান।\(This event occurs when a person is physically injured unintentionally/voluntarily/accidentally\.\)\\banglafontউদাহরণ ট্রিগার: আঘাত, ক্ষতি, আহত, অঙ্গহানি ইত্যাদি।\(Example triggers: injure, harm, injured, dismemberment, etc\.\)\\banglafontসতর্কতা:\(Note:\)\\banglafont\- আঘাতপ্রাপ্ত ব্যক্তি মারা গেলে তা এই ইভেন্টের অন্তর্ভুক্ত হবে না।\(If the injured person dies, it will not be included in this event\.\)"""mention:strtext="\\banglafontআহত শিক্ষার্থীকে উদ্ধার করা সম্ভব হয়েছে।"\(The injured student was successfully rescued\.\)result=\[Injure\(mention="\\banglafontআহত"\)\("injured"\)\]
Instruction Tuning Prompt \(w/o Guide\)[⬇](data:text/plain;base64,IyBJbnN0cnVjdGlvbgpFeHRyYWN0IGV2ZW50cyBhY2NvcmRpbmcgdG8gdGhlIHNjaGVtYS4KCiMgU2NoZW1hCkBkYXRhY2xhc3MKY2xhc3MgQXR0YWNrKENvbmZsaWN0KToKICAgIG1lbnRpb246IHN0cgoKIyBTZW50ZW5jZQp0ZXh0ID0gIigqQFxiYW5nbGF74KaW4KeB4Kay4Kao4Ka+IOCmrOCmv+CmtuCnjeCmrOCmrOCmv+CmpuCnjeCmr+CmvuCmsuCnn+CnhyDgprbgpr/gppXgp43gprfgpr7gprDgp43gpqXgp4Ag4Kao4Ka/4Kaw4KeN4Kav4Ka+4Kak4Kao4KeH4KawIOCmheCmreCmv+Cmr+Cni+Cml+CnhyDgprbgpr/gppXgp43gprfgpr7gprDgp43gpqXgp4DgprAg4Ka44KeA4KafIOCmrOCmvuCmpOCmv+CmsiDgppXgprDgp4fgppvgp4cg4Ka54KayIOCmleCmsOCnjeCmpOCng+CmquCmleCnjeCmt31AKikiCiAgICAgICAoKkBcdGV4dGNvbG9ye3RyYW5zZ3JheX17KFRoZSBoYWxsIGF1dGhvcml0eSBoYXMgY2FuY2VsbGVkIHRoZSBzZWF0IG9mIGEgc3R1ZGVudCBvbiBhbGxlZ2F0aW9ucyBvZiBzdHVkZW50IHRvcnR1cmUgYXQgS2h1bG5hIFVuaXZlcnNpdHkuIil9QCopCgpyZXN1bHQgPQoKIyBPdXRwdXQKWwogICAgQXR0YWNrKG1lbnRpb249IigqQFxiYW5nbGF74Kao4Ka/4Kaw4KeN4Kav4Ka+4Kak4Kao4KeH4KawfUAqKSIpLCAgKCpAXHRleHRjb2xvcnt0cmFuc2dyYXl9eygidG9ydHVyZSIpfUAqKQogICAgQ2hhcmdlSW5kaWN0KG1lbnRpb249IigqQFxiYW5nbGF74KaF4Kat4Ka/4Kav4KeL4KaX4KeHfUAqKSIpKCpAXHRleHRjb2xvcnt0cmFuc2dyYXl9eygiYWxsZWdhdGlvbnMiKX1AKikKXQ==)Extracteventsaccordingtotheschema\.@dataclassclassAttack\(Conflict\):mention:strtext="\\banglafontখুলনা বিশ্ববিদ্যালয়ে শিক্ষার্থী নির্যাতনের অভিযোগে শিক্ষার্থীর সীট বাতিল করেছে হল কর্তৃপক্ষ"\(The hall authority has cancelled the seat of a student on allegations of student torture at Khulna University\."\)result=\[Attack\(mention="\\banglafontনির্যাতনের"\),\("torture"\)ChargeIndict\(mention="\\banglafontঅভিযোগে"\)\("allegations"\)\]
### B\.2Training Details
##### Encoders
Encoder\-only models are fine\-tuned using full\-parameter supervised fine\-tuning with the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.30914#bib.bib23)\), a learning rate of2×10−52\\times 10^\{\-5\}, weight decay of0\.010\.01, warmup over10%10\\%of training steps, and a maximum sequence length of 128 tokens\. Training is conducted for up to 15 epochs with early stopping based on validation Macro\-F1 with patience 3\.
##### Decoders
Decoder\-only LLMs are adapted via Low\-Rank Adaptation\(Hu et al\.,[2021](https://arxiv.org/html/2606.30914#bib.bib14)\)with rankr=16r=16, scaling factorα=32\\alpha=32, and dropoutp=0p=0, applied to all linear projection layers\. We use a uniform rank across all model sizes followingAl Monsur et al\. \([2026](https://arxiv.org/html/2606.30914#bib.bib2)\), who demonstrate that ranks above 8 provide no additional benefit for event detection tasks\. Models are trained for up to 5 epochs with a learning rate of1×10−41\\times 10^\{\-4\}, cosine learning rate schedule, warmup ratio of0\.050\.05, effective batch size of 64 via gradient accumulation, and early stopping on validation loss with patience 3\. All experiments are conducted on a single NVIDIA A100 GPU\.Similar Articles
Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation
This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.
CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams
Researchers from Bangladesh University of Engineering and Technology present CBRS, a multi-platform framework that filters and parses blood donation requests from social media using a dual-layer architecture and a novel 11K bilingual dataset in Bengali and English. Their LoRA fine-tuned Llama-3.2-3B model achieves 99% filtering accuracy and 92% zero-shot parsing accuracy, outperforming GPT-4o-mini and other LLMs with 35× reduced token usage.
DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection
DetectRL-X is a comprehensive multilingual benchmark for evaluating LLM-generated text detectors across 8 languages and 6 domains, including stress testing with AI-assisted writing operations and perturbations. It reveals strengths and limitations of current detectors in multilingual scenarios.
MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
This paper introduces MultiSoc-4D, a benchmark for diagnosing instruction-induced label collapse in LLMs annotating Bengali social media. It reveals that LLMs systematically prefer fallback labels, leading to under-detection of minority categories like hate speech and sarcasm.
Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition
Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.