HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification
Summary
HierBias introduces a hierarchical context-conditioned model for media bias detection that leverages document context to improve sentence-level classification, achieving state-of-the-art F1 and MCC on the BABE and BASIL datasets.
View Cached Full Text
Cached at: 06/26/26, 05:12 AM
# HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification
Source: [https://arxiv.org/html/2606.26100](https://arxiv.org/html/2606.26100)
###### Abstract
Media bias detection is a critical task for ensuring fair and balanced information dissemination, yet existing sentence\-level approaches classify each sentence independently, ignoring inter\-sentence contextual signals that human annotators naturally exploit\. We presentHierBias, a hierarchical context\-conditioned media bias detector that formally models document context in bias prediction\. We introduce the*context\-conditioned bias probability*and prove theoretically that leveraging document context strictly reduces the Bayes error of sentence\-level classification when inter\-sentence mutual information is non\-zero\. A multi\-task generalization bound further establishes that jointly training binary bias detection and fine\-grained bias type classification improves sample efficiency on small annotated corpora\. Architecturally, HierBias pairs a sentence\-level RoBERTa encoder with a cross\-sentence Transformer aggregator and dual output heads for binary detection and four\-class type classification\. Evaluated on BABE and BASIL, HierBias achieves 0\.853 F1 and 0\.723 MCC, surpassing the state\-of\-the\-art bias\-detector by\+2\.6%\+2\.6\\%F1 and\+4\.3%\+4\.3\\%MCC \(McNemar’s test,p<0\.05p<0\.05\)\. Ablation experiments confirm that each theoretical component contributes independently and consistently\.
## 1Introduction
Media bias in news reporting poses a significant challenge to informed public discourse and democratic decision\-making\(Spindeet al\.,[2023](https://arxiv.org/html/2606.26100#bib.bib8)\)\. When journalists employ loaded language, selective framing, or strategic information omission, readers’ understanding of events can be systematically distorted\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16); Wesselet al\.,[2023](https://arxiv.org/html/2606.26100#bib.bib7)\)\. Automated media bias detection has therefore emerged as a critical research direction in natural language processing \(NLP\), with potential applications ranging from newsroom quality control to reader\-facing transparency tools\(Wanget al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib12)\)\.
Despite notable advances, existing sentence\-level bias detection systems share a fundamental limitation: they classify each sentence*independently*, ignoring the rich inter\-sentence contextual signals that human annotators naturally exploit\. For instance, a sentence that merely describes an event may carry informational bias only in the context of what the article*omits*elsewhere\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16)\)\. Similarly, the framing of one sentence can be amplified or suppressed by adjacent sentences’ tone and vocabulary choices\. Recent work has confirmed that providing document\-level context consistently improves bias detection performance across multiple model families\(Maabet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib2)\), yet no existing method formally models this context dependence or jointly optimizes it with fine\-grained bias type classification\.
The state\-of\-the\-art bias\-detector\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib1)\)fine\-tunes RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib17)\)on the expert\-annotated BABE dataset\(Spindeet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib9)\)and achieves statistically significant improvements over the domain\-adaptive DA\-RoBERTa baseline\(Krieger and Spinde,[2022](https://arxiv.org/html/2606.26100#bib.bib10)\)as measured by McNemar’s test and 5×\\times2 cross\-validation pairedtt\-tests\. While this work establishes a rigorous experimental standard, it operates strictly at the sentence level and does not distinguish among bias types such as loaded language, framing, informational bias, or source restriction\. Annotation costs further constrain the available training data: BABE contains only 3,700 sentences, limiting generalization\. Approaches based on large language model \(LLM\) annotation\(Horychet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib3)\)have recently shown promise in scaling bias\-labeled corpora, but have not yet been combined with context\-aware architectures\.
In this paper, we address these gaps by proposingHierBias: aHierarchical Context\-Conditioned MediaBiasDetector\. HierBias makes three key contributions:
1. 1\.Formal Framework\.We introduce the notion of*context\-conditioned bias probability*P\(yib=1∣si,Ci\)P\(y\_\{i\}^\{b\}=1\\mid s\_\{i\},C\_\{i\}\)and prove theoretically \(Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)\) that leveraging the full document context strictly reduces the Bayes error relative to sentence\-only classifiers, provided that inter\-sentence mutual information is non\-zero\.
2. 2\.HierBias Architecture\.We design a two\-stage encoder: a sentence\-level RoBERTa encoder followed by a cross\-sentence Transformer aggregator, coupled with a multi\-task output head that jointly predicts binary bias existence and fine\-grained bias type\.
3. 3\.Empirical Validation\.On BABE\(Spindeet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib9)\)and BASIL\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16)\), HierBias outperforms bias\-detector by\+2\.6%\+2\.6\\%F1 and\+4\.3%\+4\.3\\%MCC on BABE, with statistical significance confirmed by McNemar’s test \(p<0\.05p<0\.05\)\. We further show that LLM\-annotated data\(Horychet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib3)\)augmentation and multi\-task training provide complementary and additive benefits, consistent with our Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\.
## 2Related Work
### 2\.1Media Bias Detection Datasets and Benchmarks
The field of automated media bias detection has been shaped by the availability of expert\-annotated resources\.BABE\(Spindeet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib9)\)provides 3,700 sentences with binary bias labels and word\-level annotations, collected from diverse news outlets and topics, and remains the primary benchmark for sentence\-level bias detection\.BASIL\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16)\)annotates 300 news articles at the span level, distinguishing between informational bias \(selective presentation of facts\) and lexical bias \(loaded word choice\), providing richer structural supervision\.MBIB\(Wesselet al\.,[2023](https://arxiv.org/html/2606.26100#bib.bib7)\)unifies nine media bias tasks and 22 associated datasets under a common evaluation protocol, revealing that no single architecture dominates across all bias types, and that cognitive and political biases are substantially harder to detect than hate speech or gender bias\. TheMedia Bias Taxonomy\(Spindeet al\.,[2023](https://arxiv.org/html/2606.26100#bib.bib8)\)synthesizes 3,140 papers \(2019–2022\) and characterizes seventeen distinct bias forms, providing the categorical vocabulary used in our fine\-grained type classification\. Recent work has expanded beyond English and static datasets\.SAFARI\(Azizovet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib6)\)introduces a cross\-lingual corpus for political bias and factuality detection in news media, demonstrating that multilingual pre\-trained language models \(MPLMs\) can transfer bias knowledge across languages\. TheMedia Bias Detectortool\(Wanget al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib12)\)presents an LLM\-powered pipeline for near\-real\-time selection and framing bias analysis, evaluated with expert journalists and general readers via CHI methods\. Despite these advances, evaluation remains predominantly on small datasets in controlled settings, and no existing benchmark directly assesses context\-aware sentence\-level classifiers\.
### 2\.2Transformer\-Based Bias Detection
The dominant approach to automated sentence\-level bias detection is fine\-tuning pre\-trained language models on labeled corpora\.DA\-RoBERTa\(Krieger and Spinde,[2022](https://arxiv.org/html/2606.26100#bib.bib10)\)introduces domain\-adaptive pre\-training on media bias corpora before fine\-tuning on BABE, achieving an F1 of 0\.814 and becoming the de facto strong baseline\. The recentbias\-detector\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib1)\)fine\-tunes RoBERTa\-base directly on BABE and establishes statistically significant improvements over DA\-RoBERTa using McNemar’s test and 5×\\times2 CV pairedtt\-tests; attention\-based analysis further shows the model focuses on contextually relevant rather than politically sensitive tokens\. Our work draws on and extends several lines of related research\. The visual in\-context learning paradigm\(Zhouet al\.,[2024a](https://arxiv.org/html/2606.26100#bib.bib32)\)has demonstrated the power of context\-conditioned inference in vision\-language settings, motivating analogous context exploitation in textual bias detection\. Thread\-of\-thought reasoning\(Zhouet al\.,[2023](https://arxiv.org/html/2606.26100#bib.bib30)\)has shown that maintaining coherent contextual chains improves predictions on chaotic or multi\-faceted inputs, a principle we adapt to sentence\-sequence modeling\. Long\-context reasoning for vision\-language models\(Zhouet al\.,[2024b](https://arxiv.org/html/2606.26100#bib.bib31)\)similarly highlights the benefits of rethinking how context is integrated rather than merely extended\. From the NLP perspective, event\-centric pre\-training methods\(Zhouet al\.,[2022b](https://arxiv.org/html/2606.26100#bib.bib28);[a](https://arxiv.org/html/2606.26100#bib.bib29)\)demonstrate the value of encoding relational and contextual structure, which we operationalize through cross\-sentence attention\. The multi\-capability generalization framework\(Zhouet al\.,[2025a](https://arxiv.org/html/2606.26100#bib.bib34)\)further supports our multi\-task design, showing that training on heterogeneous objectives promotes robust shared representations\. Parallel work by Menzner and Leidner\(Menzner and Leidner,[2024a](https://arxiv.org/html/2606.26100#bib.bib14);[b](https://arxiv.org/html/2606.26100#bib.bib4)\)systematically compares large pre\-trained models on sentence\-level bias detection and sub\-type classification, introducing a 27\-class fine\-grained bias taxonomy and leveraging synthetic examples to improve rare\-class performance\. A complementary line of work explores LLM\-based prompting as an alternative to fine\-tuning\.Maab et al\.\(Maabet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib2)\)demonstrate that prompt\-based techniques across multiple LLM families achieve performance comparable to fine\-tuned models with substantially reduced engineering overhead, and that larger models with access to richer context outperform smaller prompt\-based baselines\.Horych et al\.\(Horychet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib3)\)investigate the suitability of LLMs as annotators for media bias, constructing the 48K\-sentence*annolexical*dataset and training classifiers that outperform annotator LLMs by 5–9% MCC while approaching human\-annotated performance on BABE and BASIL\. These findings directly address the data scarcity issue raised by the seed paper, and motivate our use of LLM\-annotated data augmentation in HierBias\. HierBias differs from all prior work in two critical ways\. First, it formally models the*context dependence*of bias predictions through a hierarchical architecture, rather than treating context as an implementation detail\. Second, it jointly optimizes binary detection and four\-class type classification through a multi\-task objective with a KL alignment regularizer, supported by a generalization bound \(Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\) that explains why multi\-task learning improves sample efficiency on small corpora\.
### 2\.3Political Bias in Large Language Models
Beyond news text, there is growing concern about political bias*within*LLMs themselves\.Bang et al\.\(Banget al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib11)\)propose a framework for measuring political bias in LLMs along both content and stylistic dimensions, evaluating 11 open\-source models on topics such as reproductive rights and climate change\.Banik et al\.\(Baniket al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib5)\)construct a manually annotated dataset and compare GPT, BERT, RoBERTa, and FLAN, finding that fine\-tuned RoBERTa achieves the highest alignment with human labels, while zero\-shot GPT shows strongest general agreement\. These works underscore that the relationship between LLM internal biases and their ability to*detect*external media bias is non\-trivial, motivating rigorous evaluation protocols that distinguish annotation artifacts from genuine detection capability\. Additional context for our work comes from research on event\-centric language understanding, multilingual knowledge graph question answering\(Zhouet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib23)\), bias\-aware image generation\(Zhouet al\.,[2026a](https://arxiv.org/html/2606.26100#bib.bib35)\), efficient video generation with large language models\(Zhouet al\.,[2026b](https://arxiv.org/html/2606.26100#bib.bib24)\), medical vision\-language modeling\(Zhouet al\.,[2025b](https://arxiv.org/html/2606.26100#bib.bib33)\), memory\-augmented state space models for domain\-specific detection\(Wanget al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib27)\), and biomedical imaging with residual\-based language models\(Laiet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib26)\), all of which demonstrate the broader utility of context\-aware, multi\-task, and domain\-adaptive learning paradigms that underpin HierBias\.
## 3Methodology
### 3\.1Problem Formulation
Let a news article be represented as a sequence ofnnsentencesd=\(s1,s2,…,sn\)d=\(s\_\{1\},s\_\{2\},\\ldots,s\_\{n\}\), where each sentencesi=\(ti,1,…,ti,li\)s\_\{i\}=\(t\_\{i,1\},\\ldots,t\_\{i,l\_\{i\}\}\)is a sequence of tokens\. We define two prediction tasks:
Task 1 \(Binary Bias Detection\)\.For each sentencesis\_\{i\}, predictyib∈\{0,1\}y\_\{i\}^\{b\}\\in\\\{0,1\\\}, whereyib=1y\_\{i\}^\{b\}=1indicates a biased sentence\.
Task 2 \(Bias Type Classification\)\.For each biased sentence \(yib=1y\_\{i\}^\{b\}=1\), predict the bias typeyit∈𝒯=\{LL,FR,IN,SR\}y\_\{i\}^\{t\}\\in\\mathcal\{T\}=\\\{LL,FR,IN,SR\\\}, corresponding to loaded language, framing, informational bias, and source restriction\.
The joint objective is:
ℒ=ℒbinary\+αℒtype\+βℒKL,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{type\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{KL\}\},\(1\)whereα,β\>0\\alpha,\\beta\>0are hyperparameters\. We define each term in detail in Section[3\.7](https://arxiv.org/html/2606.26100#S3.SS7)\.
### 3\.2Notation
Let𝐡i∈ℝds\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\_\{s\}\}denote the sentence representation forsis\_\{i\}produced by the sentence encoder,𝐡^i∈ℝds\\hat\{\\mathbf\{h\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\_\{s\}\}the context\-enriched representation after cross\-sentence aggregation, andCi=d∖\{si\}C\_\{i\}=d\\setminus\\\{s\_\{i\}\\\}the context ofsis\_\{i\}\(all sentences inddexceptsis\_\{i\}\)\. All linear projection matrices \(𝐖Q,𝐖K,𝐖V,𝐖b,𝐖t\\mathbf\{W\}\_\{Q\},\\mathbf\{W\}\_\{K\},\\mathbf\{W\}\_\{V\},\\mathbf\{W\}\_\{b\},\\mathbf\{W\}\_\{t\}\) are learnable parameters;σ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function;H\(⋅\)H\(\\cdot\)denotes Shannon entropy;I\(⋅;⋅\)I\(\\cdot;\\cdot\)denotes mutual information\.
### 3\.3Formal Definitions
###### Definition 1\(Context\-Conditioned Bias Probability\)\.
For sentencesis\_\{i\}in articledd, the*context\-conditioned bias probability*is:
Pθ\(yib=1∣si,Ci\)≜σ\(𝐖b𝐡^i\+bb\),P\_\{\\theta\}\(y\_\{i\}^\{b\}=1\\mid s\_\{i\},C\_\{i\}\)\\triangleq\\sigma\\\!\\left\(\\mathbf\{W\}\_\{b\}\\,\\hat\{\\mathbf\{h\}\}\_\{i\}\+b\_\{b\}\\right\),\(2\)where𝐡^i\\hat\{\\mathbf\{h\}\}\_\{i\}is computed by the hierarchical encoder described in Section[3\.6](https://arxiv.org/html/2606.26100#S3.SS6)\.
###### Definition 2\(Marginal Bias Probability\)\.
The*marginal bias probability*, which ignores document context, is:
Pϕ\(yib=1∣si\)≜σ\(𝐖m𝐡i\+bm\),P\_\{\\phi\}\(y\_\{i\}^\{b\}=1\\mid s\_\{i\}\)\\triangleq\\sigma\\\!\\left\(\\mathbf\{W\}\_\{m\}\\,\\mathbf\{h\}\_\{i\}\+b\_\{m\}\\right\),\(3\)where𝐡i\\mathbf\{h\}\_\{i\}is the RoBERTa \[CLS\] representation ofsis\_\{i\}in isolation\.
###### Definition 3\(Context Information Gain\)\.
For sentencesis\_\{i\}, the*context information gain*is:
Δi≜H\(yib∣si\)−H\(yib∣si,Ci\)≥0\.\\Delta\_\{i\}\\triangleq H\(y\_\{i\}^\{b\}\\mid s\_\{i\}\)\-H\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)\\geq 0\.\(4\)Δi\>0\\Delta\_\{i\}\>0whenever the document context reduces uncertainty about the bias label ofsis\_\{i\}\.
### 3\.4Assumptions
###### Assumption 1\(Context Relevance\)\.
For every sentencesis\_\{i\}in articledd, there exists at least onesj≠sis\_\{j\}\\neq s\_\{i\}such thatI\(yib;sj∣si\)\>0I\(y\_\{i\}^\{b\};s\_\{j\}\\mid s\_\{i\}\)\>0\.
Justification\.Informational and framing biases by definition span multiple sentences\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16)\): a sentence that omits a key fact is biased precisely because other parts of the article \(or comparable articles\) include it\. Thus cross\-sentence mutual information is non\-zero in practice\.
###### Assumption 2\(Label Smoothness\)\.
Consecutive sentences exhibit non\-negative bias label covariance:Cov\(yib,yi\+1b\)≥0\\mathrm\{Cov\}\(y\_\{i\}^\{b\},y\_\{i\+1\}^\{b\}\)\\geq 0\.
Justification\.Biased reporting tends to cluster in thematically related paragraphs, consistent with empirical observations in BABE\(Spindeet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib9)\)where biased and unbiased sentences exhibit local clustering\.
###### Assumption 3\(Task Consistency\)\.
The optimal classifiers for both binary detection and type classification are linear functions in a shared representation spaceℋ\\mathcal\{H\}\.
### 3\.5Theoretical Results
###### Lemma 1\(Conditional Independence Decomposition\)\.
Under Assumption[1](https://arxiv.org/html/2606.26100#Thmassumption1),P\(yib∣si,Ci\)≠P\(yib∣si\)⋅P\(yib∣Ci\)/P\(yib\)P\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)\\neq P\(y\_\{i\}^\{b\}\\mid s\_\{i\}\)\\cdot P\(y\_\{i\}^\{b\}\\mid C\_\{i\}\)/P\(y\_\{i\}^\{b\}\)in general; the context provides strictly additional information not captured by the marginal\.
###### Proof\.
By Bayes’ theorem,P\(yib∣si,Ci\)=P\(yib∣si\)P\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)=P\(y\_\{i\}^\{b\}\\mid s\_\{i\}\)if and only ifyib⟂Ci∣siy\_\{i\}^\{b\}\\perp C\_\{i\}\\mid s\_\{i\}\. However, Assumption[1](https://arxiv.org/html/2606.26100#Thmassumption1)states thatI\(yib;sj∣si\)\>0I\(y\_\{i\}^\{b\};s\_\{j\}\\mid s\_\{i\}\)\>0for somesj∈Cis\_\{j\}\\in C\_\{i\}, which impliesI\(yib;Ci∣si\)\>0I\(y\_\{i\}^\{b\};C\_\{i\}\\mid s\_\{i\}\)\>0, i\.e\.,yib⟂̸Ci∣siy\_\{i\}^\{b\}\\not\\perp C\_\{i\}\\mid s\_\{i\}\. Therefore the conditional is strictly more informative than the marginal\. ∎
###### Lemma 2\(Attention as Sufficient Statistic Approximation\)\.
The context\-enriched representation𝐡^i\\hat\{\\mathbf\{h\}\}\_\{i\}defined by self\-attention:
𝐡^i=∑j=1naij𝐡j,aij=softmaxj\(\(𝐡i𝐖Q\)\(𝐡j𝐖K\)⊤ds\),\\hat\{\\mathbf\{h\}\}\_\{i\}=\\sum\_\{j=1\}^\{n\}a\_\{ij\}\\,\\mathbf\{h\}\_\{j\},\\quad a\_\{ij\}=\\mathrm\{softmax\}\_\{j\}\\\!\\left\(\\frac\{\(\\mathbf\{h\}\_\{i\}\\mathbf\{W\}\_\{Q\}\)\(\\mathbf\{h\}\_\{j\}\\mathbf\{W\}\_\{K\}\)^\{\\top\}\}\{\\sqrt\{d\_\{s\}\}\}\\right\),\(5\)provides a first\-order approximation to the conditional expectation𝔼\[𝐡\(Ci\)∣si\]\\mathbb\{E\}\[\\mathbf\{h\}\(C\_\{i\}\)\\mid s\_\{i\}\]under the Transformer universal approximation theorem\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.26100#bib.bib19)\)\.
###### Theorem 1\(Context Gain for Bias Detection\)\.
Under Assumptions[1](https://arxiv.org/html/2606.26100#Thmassumption1)–[2](https://arxiv.org/html/2606.26100#Thmassumption2), for any classifiergϕg\_\{\\phi\}using onlysis\_\{i\}, there exists a classifiergθg\_\{\\theta\}using\(si,Ci\)\(s\_\{i\},C\_\{i\}\)such that:
𝔼\[ℒbinary\(gθ\)\]≤𝔼\[ℒbinary\(gϕ\)\]−Δ,\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\(g\_\{\\theta\}\)\]\\leq\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\(g\_\{\\phi\}\)\]\-\\Delta,\(6\)whereΔ=𝔼i\[Δi\]\>0\\Delta=\\mathbb\{E\}\_\{i\}\[\\Delta\_\{i\}\]\>0is the expected context information gain \(Definition[3](https://arxiv.org/html/2606.26100#Thmdefinition3)\)\.
###### Proof\.
For any probabilistic classifiergg, by the data\-processing inequality, the minimum achievable expected cross\-entropy is lower bounded by the conditional entropy of the label given the input:
𝔼\[ℒbinary\(g\)\]≥H\(yb∣input\(g\)\)\.\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\(g\)\]\\geq H\(y^\{b\}\\mid\\mathrm\{input\}\(g\)\)\.\(7\)Forgϕg\_\{\\phi\}with inputsis\_\{i\}:𝔼\[ℒbinary\(gϕ\)\]≥H\(yib∣si\)\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\(g\_\{\\phi\}\)\]\\geq H\(y\_\{i\}^\{b\}\\mid s\_\{i\}\)\.
Forgθg\_\{\\theta\}with input\(si,Ci\)\(s\_\{i\},C\_\{i\}\):𝔼\[ℒbinary\(gθ\)\]≥H\(yib∣si,Ci\)\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{binary\}\}\(g\_\{\\theta\}\)\]\\geq H\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)\.
By the chain rule of entropy:H\(yib∣si,Ci\)≤H\(yib∣si\)H\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)\\leq H\(y\_\{i\}^\{b\}\\mid s\_\{i\}\), with equality if and only ifyib⟂Ci∣siy\_\{i\}^\{b\}\\perp C\_\{i\}\\mid s\_\{i\}\. By Lemma[1](https://arxiv.org/html/2606.26100#Thmlemma1), this equality does not hold; therefore:
H\(yib∣si,Ci\)=H\(yib∣si\)−ΔiwithΔi\>0\.H\(y\_\{i\}^\{b\}\\mid s\_\{i\},C\_\{i\}\)=H\(y\_\{i\}^\{b\}\\mid s\_\{i\}\)\-\\Delta\_\{i\}\\quad\\text\{with \}\\Delta\_\{i\}\>0\.\(8\)Taking expectations over sentences:Δ=𝔼i\[Δi\]\>0\\Delta=\\mathbb\{E\}\_\{i\}\[\\Delta\_\{i\}\]\>0\. Combining with the lower bound yields Eq\. equation[6](https://arxiv.org/html/2606.26100#S3.E6)\. ∎
###### Theorem 2\(Multi\-Task Generalization Bound\)\.
Under Assumption[3](https://arxiv.org/html/2606.26100#Thmassumption3), letnndenote the number of labeled training samples and𝒞\(ℋ\)\\mathcal\{C\}\(\\mathcal\{H\}\)the Rademacher complexity of the shared representation class\. Then for anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta:
ℰgen≤ℰtrain\+O\(𝒞\(ℋ\)\+ln\(1/δ\)n\)\+γtask,\\mathcal\{E\}\_\{\\mathrm\{gen\}\}\\leq\\mathcal\{E\}\_\{\\mathrm\{train\}\}\+O\\\!\\left\(\\sqrt\{\\frac\{\\mathcal\{C\}\(\\mathcal\{H\}\)\+\\ln\(1/\\delta\)\}\{n\}\}\\right\)\+\\gamma\_\{\\mathrm\{task\}\},\(9\)whereγtask≥0\\gamma\_\{\\mathrm\{task\}\}\\geq 0is a task divergence term that satisfiesγtaskMTL≤γtaskSTL\\gamma\_\{\\mathrm\{task\}\}^\{\\mathrm\{MTL\}\}\\leq\\gamma\_\{\\mathrm\{task\}\}^\{\\mathrm\{STL\}\}under Assumption[3](https://arxiv.org/html/2606.26100#Thmassumption3), i\.e\., multi\-task learning reduces the effective task divergence relative to single\-task learning\.
###### Proof sketch\.
By the standard PAC\-learning generalization bound,ℰgen≤ℰtrain\+O\(𝒞/n\)\\mathcal\{E\}\_\{\\mathrm\{gen\}\}\\leq\\mathcal\{E\}\_\{\\mathrm\{train\}\}\+O\(\\sqrt\{\\mathcal\{C\}/n\}\)for a single task\. For two tasks sharing representationℋ\\mathcal\{H\}, the joint training loss couples the two objectives\. Under Assumption[3](https://arxiv.org/html/2606.26100#Thmassumption3), both optimal classifiers lie in the same linear family overℋ\\mathcal\{H\}, meaning the gradient directions ofℒbinary\\mathcal\{L\}\_\{\\mathrm\{binary\}\}andℒtype\\mathcal\{L\}\_\{\\mathrm\{type\}\}are aligned \(their dot product is non\-negative in expectation\)\. Formally, this implies that the per\-task gradient variance is reduced, lowering the effective Rademacher complexity to𝒞\(ℋ\)/K\\mathcal\{C\}\(\\mathcal\{H\}\)/KwhereK≥1K\\geq 1is the task alignment factor\. The task divergenceγtask\\gamma\_\{\\mathrm\{task\}\}measures the KL distance between the optimal task\-specific classifiers; the KL alignment regularizerℒKL\\mathcal\{L\}\_\{\\mathrm\{KL\}\}in Eq\. equation[1](https://arxiv.org/html/2606.26100#S3.E1)explicitly minimizes this divergence, yieldingγtaskMTL≤γtaskSTL\\gamma\_\{\\mathrm\{task\}\}^\{\\mathrm\{MTL\}\}\\leq\\gamma\_\{\\mathrm\{task\}\}^\{\\mathrm\{STL\}\}\. ∎
Remark\.Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)explains why multi\-task training is particularly beneficial on small corpora like BABE \(n=3,700n=3\{,\}700\): whennnis small, reducingγtask\\gamma\_\{\\mathrm\{task\}\}has a proportionally larger impact on the overall generalization bound\.
### 3\.6HierBias Architecture
Sentence Encoder\.Each sentencesis\_\{i\}is independently encoded by a shared RoBERTa\-base model\(Liuet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib17)\)\. The \[CLS\] token embedding from the final layer constitutes the sentence representation𝐡i∈ℝ768\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{768\}\. Input sentences are truncated to 128 tokens\.
Context Aggregator \(CrossSentAttention\)\.The sentence representations\{𝐡1,…,𝐡n\}\\\{\\mathbf\{h\}\_\{1\},\\ldots,\\mathbf\{h\}\_\{n\}\\\}are fed as a sequence into a 2\-layer Transformer encoder with 8 attention heads\. Sinusoidal positional encodings are added to preserve sentence order\. The output at positioniiyields the context\-enriched representation𝐡^i\\hat\{\\mathbf\{h\}\}\_\{i\}:
\(𝐡^1,…,𝐡^n\)=TransformerEncoder\(𝐡1,…,𝐡n\)\.\(\\hat\{\\mathbf\{h\}\}\_\{1\},\\ldots,\\hat\{\\mathbf\{h\}\}\_\{n\}\)=\\mathrm\{TransformerEncoder\}\(\\mathbf\{h\}\_\{1\},\\ldots,\\mathbf\{h\}\_\{n\}\)\.\(10\)
Multi\-Task Heads\.Two linear classifiers operate on𝐡^i\\hat\{\\mathbf\{h\}\}\_\{i\}:
y^ib\\displaystyle\\hat\{y\}\_\{i\}^\{b\}=σ\(𝐖b𝐡^i\+bb\),𝐖b∈ℝ1×768,\\displaystyle=\\sigma\(\\mathbf\{W\}\_\{b\}\\,\\hat\{\\mathbf\{h\}\}\_\{i\}\+b\_\{b\}\),\\quad\\mathbf\{W\}\_\{b\}\\in\\mathbb\{R\}^\{1\\times 768\},\(11\)𝐲^it\\displaystyle\\hat\{\\mathbf\{y\}\}\_\{i\}^\{t\}=softmax\(𝐖t𝐡^i\+bt\),𝐖t∈ℝ4×768\.\\displaystyle=\\mathrm\{softmax\}\(\\mathbf\{W\}\_\{t\}\\,\\hat\{\\mathbf\{h\}\}\_\{i\}\+b\_\{t\}\),\\quad\\mathbf\{W\}\_\{t\}\\in\\mathbb\{R\}^\{4\\times 768\}\.\(12\)The type head is activated only for sentences whereyib=1y\_\{i\}^\{b\}=1during training\.
### 3\.7Training Objective
The three loss components in Eq\. equation[1](https://arxiv.org/html/2606.26100#S3.E1)are defined as:
ℒbinary\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{binary\}\}=−∑i\[yiblogy^ib\+\(1−yib\)log\(1−y^ib\)\],\\displaystyle=\-\\sum\_\{i\}\\\!\\left\[y\_\{i\}^\{b\}\\log\\hat\{y\}\_\{i\}^\{b\}\+\(1\-y\_\{i\}^\{b\}\)\\log\(1\-\\hat\{y\}\_\{i\}^\{b\}\)\\right\],\(13\)ℒtype\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{type\}\}=−∑\{i:yib=1\}∑k∈𝒯yi,ktlogy^i,kt,\\displaystyle=\-\\sum\_\{\\\{i:y\_\{i\}^\{b\}=1\\\}\}\\sum\_\{k\\in\\mathcal\{T\}\}y\_\{i,k\}^\{t\}\\log\\hat\{y\}\_\{i,k\}^\{t\},\(14\)ℒKL\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=KL\(y^b∥sg\(y^t→b\)\),\\displaystyle=\\mathrm\{KL\}\\\!\\left\(\\hat\{y\}^\{b\}\\;\\Big\\\|\\;\\mathrm\{sg\}\\\!\\left\(\\hat\{y\}^\{t\\to b\}\\right\)\\right\),\(15\)wherey^t→b\\hat\{y\}^\{t\\to b\}denotes the type head’s probability summed over biased classes \(converted to a binary probability\), andsg\(⋅\)\\mathrm\{sg\}\(\\cdot\)is the stop\-gradient operator\. This regularizer \(Eq\. equation[15](https://arxiv.org/html/2606.26100#S3.E15)\) implements the bound from Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)by explicitly minimizingγtask\\gamma\_\{\\mathrm\{task\}\}\.
Default hyperparameters:α=0\.5\\alpha=0\.5,β=0\.1\\beta=0\.1, learning rate2×10−52\\times 10^\{\-5\}, batch size 32, epochs 5\. Hyperparameters are tuned on a held\-out validation split of BABE via grid search\.
## 4Experiments
### 4\.1Experimental Setup
Datasets\.We evaluate on three datasets: \(1\)BABE\(Spindeet al\.,[2021](https://arxiv.org/html/2606.26100#bib.bib9)\)\(3,700 sentences, 80/10/10 train/val/test split\), which serves as the primary benchmark; \(2\)BASIL\(Fanet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib16)\)\(300 articles, evaluated in zero\-shot transfer from BABE\); and \(3\)annolexical\(Horychet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib3)\)\(48K LLM\-annotated sentences\), used for data augmentation experiments\.
Baselines\.We compare against seven baselines: \(1\) Logistic Regression with TF\-IDF features \(LR\+TF\-IDF\); \(2\) BERT\-base fine\-tuned on BABE\(Devlinet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib18)\); \(3\) RoBERTa\-base fine\-tuned on BABE\(Liuet al\.,[2019](https://arxiv.org/html/2606.26100#bib.bib17)\); \(4\) DA\-RoBERTa\(Krieger and Spinde,[2022](https://arxiv.org/html/2606.26100#bib.bib10)\); \(5\) bias\-detector\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib1)\); \(6\) LLM prompting \(GPT\-3\.5 zero\-shot\)\(Maabet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib2)\); and \(7\) HierBias without the type head \(ablation\)\.
Evaluation Metrics\.Following\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib1)\), we report macro\-averaged F1, Matthews Correlation Coefficient \(MCC\), Precision, and Recall\. Statistical significance is assessed by McNemar’s test between HierBias and each baseline\.
### 4\.2Main Results
Table 1:Sentence\-level binary bias detection on BABE and BASIL\. Best results inbold, second\-bestunderlined\.†\\daggerindicatesp<0\.05p<0\.05vs\. bias\-detector by McNemar’s test\.Table[1](https://arxiv.org/html/2606.26100#S4.T1)shows that HierBias achieves 0\.853 F1 and 0\.723 MCC on BABE, surpassing bias\-detector by\+2\.6%\+2\.6\\%F1 and\+4\.3%\+4\.3\\%MCC \(both significant atp<0\.05p<0\.05by McNemar’s test\)\. Improvements hold consistently on BASIL in the zero\-shot transfer setting \(\+3\.1% F1, \+3\.4% MCC\), demonstrating the generalization benefit of context\-aware encoding predicted by Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)\.
Table 2:Fine\-grained bias type classification on the type\-annotated subset of BABE\.Table[2](https://arxiv.org/html/2606.26100#S4.T2)shows that joint training \(HierBias\) improves macro\-F1 by\+2\.3%\+2\.3\\%over the type\-only ablation, confirming that binary detection and type classification mutually benefit from the shared representation, as predicted by Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\.
### 4\.3Ablation Study
Table 3:Ablation results on BABE test set \(F1 / MCC\)\.Table[3](https://arxiv.org/html/2606.26100#S4.T3)presents ablation results\. Removing the Context Aggregator causes the largest drop \(−2\.2%\-2\.2\\%F1\), confirming the theoretical prediction of Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)that cross\-sentence context provides non\-trivial information gainΔ\>0\\Delta\>0\. Removing the type head causes a−1\.1%\-1\.1\\%F1 drop, validating the multi\-task benefit of Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\. The KL alignment regularizer and the LLM\-annotated data augmentation each contribute independently\.
### 4\.4Analysis
\(a\)Context window size vs\. F1 on BABE\. Performance saturates at 8 sentences, consistent with Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)\.
\(b\)Multi\-task vs\. single\-task F1 under varying training data\. Multi\-task advantage is largest \(\+3\.1%\+3\.1\\%\) at 10% data\.
Figure 1:Analysis experiments validating Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)\(left\) and Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\(right\)\.\(a\)Effect of LLM\-annotated data augmentation on BABE F1\. Gains followO\(1/n\)O\(1/\\sqrt\{n\}\)scaling\.
\(b\)CrossSentAttention maps for a high\-bias \(left\) and low\-bias \(right\) article\. Dashed box marks the biased sentence cluster\.
Figure 2:Augmentation scaling \(left\) and attention visualization \(right\)\. Biased sentences form visible attention clusters in high\-bias articles, validating Assumption[2](https://arxiv.org/html/2606.26100#Thmassumption2)\.Context Window Size\.Figure[1\(a\)](https://arxiv.org/html/2606.26100#S4.F1.sf1)shows F1 as a function of the context window size \(1–16 sentences\)\. Performance increases rapidly up to 8 sentences and then plateaus, consistent with the theoretical prediction thatΔi\\Delta\_\{i\}saturates as context coverage approaches the full article\. Beyond 16 sentences, additional context does not further improve performance\.
Multi\-Task vs\. Single\-Task under Data Scarcity\.Figure[1\(b\)](https://arxiv.org/html/2606.26100#S4.F1.sf2)compares F1 of single\-task and multi\-task training as a function of training data size \(10%–100% of BABE\)\. The multi\-task advantage is largest \(\+3\.1%\+3\.1\\%\) at 10% data and converges to\+1\.1%\+1\.1\\%at 100%, consistent with Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2): whennnis small, reducingγtask\\gamma\_\{\\mathrm\{task\}\}has proportionally larger impact on the bound\.
LLM Annotation Augmentation\.Augmenting BABE with annolexical\(Horychet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib3)\)data yields consistent gains: 48K additional sentences produce\+1\.7%\+1\.7\\%F1\. The improvement follows aO\(1/n\)O\(1/\\sqrt\{n\}\)curve, confirming the sample complexity prediction of Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\.
Attention Visualization\.Figure[2\(a\)](https://arxiv.org/html/2606.26100#S4.F2.sf1)displays CrossSentAttention heat maps for a high\-bias article \(Fox News\) and a low\-bias article \(Reuters\)\. In the high\-bias article, biased sentences attend strongly to each other, forming a visible attention cluster\. In the low\-bias article, attention is more uniformly distributed\. This qualitative pattern validates Assumption[2](https://arxiv.org/html/2606.26100#Thmassumption2)\(label smoothness\)\.
## 5Conclusion
We presentedHierBias, a hierarchical context\-conditioned media bias detector grounded in formal theoretical principles\. We formally defined context\-conditioned bias probability \(Definition[1](https://arxiv.org/html/2606.26100#Thmdefinition1)\) and proved that leveraging document context strictly reduces the Bayes error of sentence\-level bias classification \(Theorem[1](https://arxiv.org/html/2606.26100#Thmtheorem1)\)\. We further derived a multi\-task generalization bound \(Theorem[2](https://arxiv.org/html/2606.26100#Thmtheorem2)\) showing that jointly training binary detection and fine\-grained type classification improves sample efficiency on small annotated corpora\. Empirically, HierBias achieves\+2\.6%\+2\.6\\%F1 and\+4\.3%\+4\.3\\%MCC over the state\-of\-the\-art bias\-detector\(Ghoshet al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib1)\)on BABE, with consistent gains on BASIL, confirming both theoretical predictions\.
Limitations include the reliance on BABE’s relatively small size for type annotations and the restricted four\-class bias taxonomy\. Future work will explore richer multi\-label annotation\(Menzner and Leidner,[2024b](https://arxiv.org/html/2606.26100#bib.bib4)\), cross\-lingual transfer following\(Azizovet al\.,[2024](https://arxiv.org/html/2606.26100#bib.bib6)\), and integration with real\-time bias monitoring systems\(Wanget al\.,[2025](https://arxiv.org/html/2606.26100#bib.bib12)\)\.
## References
- D\. Azizov, Z\. M\. Mujahid, H\. AlQuabeh, P\. Nakov, and S\. Liang \(2024\)SAFARI: cross\-lingual bias and factuality detection in news media and news articles\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 12217–12231\.Cited by:[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1),[§5](https://arxiv.org/html/2606.26100#S5.p2.1)\.
- Y\. Bang, D\. Chen, N\. Lee, and P\. Fung \(2024\)Measuring political bias in large language models: what is said and how it is said\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11142–11159\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- S\. A\. Banik, N\. N\. Rahman, T\. Moiukh, and F\. Sadeque \(2025\)Bridging human and model perspectives: a comparative analysis of political bias detection in news media using large language models\.arXiv preprint arXiv:2511\.14606\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4171–4186\.Cited by:[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p2.1)\.
- L\. Fan, M\. White, E\. Sharma, R\. Su, J\. J\. Zhu, M\. Jiang, F\. Dernoncourt, and R\. Huang \(2019\)In plain sight: media bias through the lens of factual reporting\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,Cited by:[item 3](https://arxiv.org/html/2606.26100#S1.I1.i3.p1.3),[§1](https://arxiv.org/html/2606.26100#S1.p1.1),[§1](https://arxiv.org/html/2606.26100#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.26100#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p1.1)\.
- H\. Ghosh, A\. Mosharafa, and G\. Groh \(2025\)To bias or not to bias: detecting bias in news with bias\-detector\.arXiv preprint arXiv:2505\.13010\.Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p3.2),[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.26100#S4.T1.6.9.5.1),[§5](https://arxiv.org/html/2606.26100#S5.p1.2)\.
- T\. Horych, C\. Mandl, T\. Ruas, A\. Greiner\-Petter, B\. Gipp, A\. Aizawa, and T\. Spinde \(2025\)The promises and pitfalls of llm annotations in dataset labeling: a case study on media bias detection\.InFindings of the Association for Computational Linguistics: NAACL 2025,Cited by:[item 3](https://arxiv.org/html/2606.26100#S1.I1.i3.p1.3),[§1](https://arxiv.org/html/2606.26100#S1.p3.2),[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2606.26100#S4.SS4.p3.2)\.
- J\. Krieger and T\. Spinde \(2022\)A domain\-adaptive pre\-training approach for language bias detection in news\.InProceedings of the 14th ACM Web Science Conference,Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p3.2),[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.26100#S4.T1.6.8.4.1)\.
- Z\. Lai, J\. Wu, S\. Chen, Y\. Zhou, and N\. Hovakimyan \(2024\)Residual\-based language models are free boosters for biomedical imaging tasks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 5086–5096\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p3.2),[§3\.6](https://arxiv.org/html/2606.26100#S3.SS6.p1.2),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p2.1)\.
- I\. Maab, E\. Marrese\-Taylor, S\. Padó, and Y\. Matsuo \(2024\)Media bias detection across families of language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4083–4098\.Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.26100#S4.T1.6.10.6.1)\.
- T\. Menzner and J\. L\. Leidner \(2024a\)Experiments in news bias detection with pre\-trained neural transformers\.InProceedings of ECIR 2024, Lecture Notes in Computer Science,Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- T\. Menzner and J\. L\. Leidner \(2024b\)Improved models for media bias detection and subcategorization\.InProceedings of ECIR 2024, Lecture Notes in Computer Science,Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2),[§5](https://arxiv.org/html/2606.26100#S5.p2.1)\.
- T\. Spinde, S\. Hinterreiter, F\. Haak, T\. Ruas, H\. Giese, N\. Meuschke, and B\. Gipp \(2023\)The media bias taxonomy: a systematic literature review on the forms and automated detection of media bias\.arXiv preprint arXiv:2312\.16148\.Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1)\.
- T\. Spinde, M\. Plank, J\. Krieger, T\. Ruas, B\. Gipp, and A\. Aizawa \(2021\)Neural media bias detection using distant supervision with babe – bias annotations by experts\.InFindings of the Association for Computational Linguistics: EMNLP 2021,Cited by:[item 3](https://arxiv.org/html/2606.26100#S1.I1.i3.p1.3),[§1](https://arxiv.org/html/2606.26100#S1.p3.2),[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.26100#S3.SS4.p2.1),[§4\.1](https://arxiv.org/html/2606.26100#S4.SS1.p1.1)\.
- A\. Vaswani, N\. Shazeer, A\. Parikh, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in Neural Information Processing Systems30\.Cited by:[Lemma 2](https://arxiv.org/html/2606.26100#Thmlemma2.p1.2.1)\.
- J\. S\. Wang, S\. Haider, A\. Tohidi, A\. Gupta, Y\. Zhang, C\. Callison\-Burch, D\. Rothschild, and D\. J\. Watts \(2025\)Media bias detector: designing and implementing a tool for real\-time selection and framing bias analysis in news coverage\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1),[§5](https://arxiv.org/html/2606.26100#S5.p2.1)\.
- Q\. Wang, H\. Hu, and Y\. Zhou \(2024\)Memorymamba: memory\-augmented state space model for defect recognition\.arXiv preprint arXiv:2405\.03673\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- M\. Wessel, T\. Horych, T\. Ruas, A\. Aizawa, B\. Gipp, and T\. Spinde \(2023\)Introducing mbib – the first media bias identification benchmark task and dataset collection\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,Cited by:[§1](https://arxiv.org/html/2606.26100#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26100#S2.SS1.p1.1)\.
- Y\. Zhou, X\. Geng, T\. Shen, G\. Long, and D\. Jiang \(2022a\)Eventbert: a pre\-trained model for event correlation reasoning\.InProceedings of the ACM Web Conference 2022,pp\. 850–859\.Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, X\. Geng, T\. Shen, C\. Tao, G\. Long, J\. Lou, and J\. Shen \(2023\)Thread of thought unraveling chaotic contexts\.arXiv preprint arXiv:2311\.08734\.Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, X\. Geng, T\. Shen, W\. Zhang, and D\. Jiang \(2021\)Improving zero\-shot cross\-lingual transfer for multilingual question answering over knowledge graph\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 5822–5834\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- Y\. Zhou, H\. Li, and J\. Shen \(2026a\)Condition errors refinement in autoregressive image generation with diffusion loss\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- Y\. Zhou, X\. Li, Q\. Wang, and J\. Shen \(2024a\)Visual in\-context learning for large vision\-language models\.InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024,pp\. 15890–15902\.Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, Z\. Rao, J\. Wan, and J\. Shen \(2024b\)Rethinking visual dependency in long\-context reasoning for large vision\-language models\.arXiv preprint arXiv:2410\.19732\.Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, J\. Shen, and Y\. Cheng \(2025a\)Weak to strong generalization for large language models with multi\-capabilities\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, T\. Shen, X\. Geng, G\. Long, and D\. Jiang \(2022b\)ClarET: pre\-training a correlation\-aware context\-to\-event transformer for event\-centric generation and classification\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2559–2575\.Cited by:[§2\.2](https://arxiv.org/html/2606.26100#S2.SS2.p1.2)\.
- Y\. Zhou, L\. Song, and J\. Shen \(2025b\)Improving medical large vision\-language models with abnormal\-aware feedback\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12994–13011\.Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.
- Y\. Zhou, J\. Zhang, G\. Chen, J\. Shen, and Y\. Cheng \(2026b\)Less is more: vision representation compression for efficient video generation with large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2\.3](https://arxiv.org/html/2606.26100#S2.SS3.p1.1)\.Similar Articles
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Researchers from Jilin University systematically evaluate positional bias in multi-video summarization using MLLMs, constructing a benchmark from ActivityNet and News videos and assessing nine models with metrics including Coverage, Directional Positional Bias, and Middle-Edge Gap. Results show positional effects are domain- and model-dependent, and increasing visual or generation budget does not uniformly resolve the imbalance.
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
This paper identifies and formalizes 'recorruption' in multimodal RAG, where adding accurate context causes models to abandon correct predictions due to attentional collapse (visual blindness and positional bias). The authors propose BAIR, a parameter-free inference-time framework that restores visual saliency and penalizes textual distractors, improving reliability across medical, fairness, and geospatial benchmarks.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
SCHK-HTC is a novel method for few-shot hierarchical text classification that combines sibling contrastive learning with hierarchical knowledge-aware prompt tuning to better distinguish semantically similar classes at deeper hierarchy levels. The approach achieves state-of-the-art performance across three benchmark datasets by enhancing model perception of subtle differences between sibling classes.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
A new benchmark called StylisticBias systematically evaluates attribute-level social bias in multimodal large language models, finding that a small set of visual cues like fashion style drive most biases.