Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

arXiv cs.CL 04/22/26, 04:00 AM Papers
Summary
This paper proposes Product-of-Experts (PoE) training to reduce dataset artifacts in Natural Language Inference, downweighting examples where biased models are overconfident. PoE nearly preserves accuracy on SNLI (89.10% vs. 89.30%) while reducing bias reliance by ~4.85 percentage points.
arXiv:2604.19069v1 Announce Type: new Abstract: Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.
Original Article
View Cached Full Text
Cached at: 04/22/26, 08:30 AM
# Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference
Source: [https://arxiv.org/html/2604.19069](https://arxiv.org/html/2604.19069)
###### Abstract

Neural NLI models overfit data set artifacts instead of truly reasoning\. A hypothesis\-only model gets 57\.7% in SNLI, showing strong spurious correlations, and 38\.6% of the baseline errors are the result of these artifacts\. The authors propose Product\-of\-Experts \(PoE\) training, which downweights examples where biased models are overconfident\. PoE nearly preserves accuracy \(89\.10% vs\. 89\.30%\) while cutting bias reliance by 4\.71% \(bias agreement 49\.85%→\\rightarrow45%\)\. An ablation findsλ=1\.5\\lambda=1\.5that best balances debiasing and accuracy\. Behavioral tests still reveal issues with negation and numerical reasoning\.

## 1Credits

The Product\-of\-Experts \(PoE\) debiasing approach is based on the work of Clark et al\. \(2019\)\. I acknowledge the HuggingFace team for their Transformers library and the creators of the SNLI dataset\. All experiments were performed using personally managed computing resources, and no external funded compute was used\.

## 2Introduction

Natural Language Inference \(NLI\) tasks require determining whether a*premise*logically supports, contradicts, or remains unrelated to a*hypothesis*\. Modern NLI models built on pre\-trained transformers such as BERT\[[5](https://arxiv.org/html/2604.19069#bib.bib1)\]and ELECTRA\[[4](https://arxiv.org/html/2604.19069#bib.bib2)\]achieve impressive performance on benchmark datasets, often surpassing traditional feature\-based or shallow neural approaches\. To effectively train these models, practitioners must navigate critical decisions about training objectives, model architecture, and data quality\. Previously published experimental results have demonstrated that pre\-trained models substantially outperform traditional approaches on NLI tasks\[[5](https://arxiv.org/html/2604.19069#bib.bib1),[4](https://arxiv.org/html/2604.19069#bib.bib2)\], highlighting the importance of transfer learning\.

However, there is a tradeoff: high benchmark accuracy does not guaranty genuine reasoning capabilities\. Recent work reveals that models often exploit*spurious correlations*in training data patterns that enable correct predictions without semantic understanding\[[9](https://arxiv.org/html/2604.19069#bib.bib3)\]\. For example, hypothesis\-only models that never see premise information can achieve surprisingly high accuracy by learning annotation artifacts\. This reliance on artifacts limits robustness under distribution shifts or adversarial examples where these patterns differ, raising concerns about generalization\.

The project demonstrates that dataset artifacts in SNLI actively harm model performance through systematic analysis\. My hypothesis\-only model achieves57\.7%57\.7\\%accuracy \(compared to a33\.3%33\.3\\%random baseline\), and we find that38\.6%38\.6\\%of baseline errors are artifact\-driven\. This model operates in three steps:

1. 1\.Take only the hypothesis text as input, completely ignoring the premise\.
2. 2\.Pass the hypothesis through a pre\-trained ELECTRA encoder\.
3. 3\.Perform classification on the final representation\.

While this simple model highlights the strength of dataset artifacts, it also motivates debiasing strategies\. One promising approach is*Product\-of\-Experts \(PoE\)*training, which applies a novel debiasing mechanism: for each training instance, examples where the bias model shows high confidence are downweighted before computing the loss\. This effectively reduces the influence of spurious correlations while preserving the contribution of informative examples\.

We evaluate PoE on full SNLI training in Section 4\. The model achieves89\.10%89\.10\\%accuracy compared to89\.30%89\.30\\%for standard training a minimal decrease \(−0\.20%\-0\.20\\%\) while reducing bias reliance by4\.854\.85points \(bias agreement:45\.0%45\.0\\%vs\.49\.85%49\.85\\%\)\. These results demonstrate that for NLI, the choice of training objective is crucial: PoE training enables effective debiasing with negligible performance cost\. Furthermore, our ablation study identifiesλ=1\.5\\lambda=1\.5as the optimal debiasing strength, balancing accuracy and fairness\.

A qualitative analysis of behavioral tests suggests that the model works by learning robust premise\-hypothesis interactions\. Nevertheless, error analysis reveals that even debiased models struggle with phenomena such as negation, numerical reasoning, and compositional semantics\. This indicates that while PoE reduces reliance on artifacts, it does not fully solve the deeper reasoning challenges inherent in NLI\. Future work may explore hybrid architectures that combine PoE with syntactic modeling, curriculum learning, or adversarial data augmentation to further enhance robustness\.

In summary, the findings highlight both the promise and limitations of current NLI systems\. Pre\-trained transformers provide strong baselines, but without careful attention to dataset artifacts and training objectives, models risk overfitting to superficial patterns\. Debiasing strategies such as PoE represent a step forward, offering improved calibration and fairness, yet continued research is needed to achieve true semantic reasoning in NLI\.

## 3Standard Training vs\. Debiased Training

The central objective is to combine the robustness of debiased models with the accuracy achieved through standard training\. While traditional supervised learning pipelines for Natural Language Inference \(NLI\) have achieved remarkable benchmark scores, they often fail to generalize beyond the narrow distribution of training data\. Debiasing methods attempt to correct this by explicitly addressing dataset artifacts and spurious correlations that standard training tends to exploit\.

We describe a class of artifact exploitation models dubbed*hypothesis\-only baselines*\. We then explore more sophisticated debiasing methods crafted to avoid the pitfalls aligned with standard training on biased datasets\. Finally, we present the*Product\-of\-Experts \(PoE\)*approach, which modifies the training objective to downweight artifact\-heavy examples and achieves performance on par with standard training while significantly reducing bias reliance\. This comparative analysis highlights the tradeoffs between efficiency, accuracy, and robustness in modern NLI systems\.

### 3\.1Hypothesis\-Only Baseline Models

To keep things simple, consider NLI classification: map an input premise\-hypothesis pair to one of three labels \(entailment, contradiction, neutral\)\. A hypothesis\-only baseline applies a biased composition functionggto only the hypothesis sequence, completely ignoring the premise\. The hypothesis tokens are encoded through a pre\-trained transformer to produce a representationzzwould serve as an input to a classification layer\.

In our instantiation of the hypothesis\-only model,ggprocesses only hypothesis embeddings:

z=g\(h∈H\)=ELECTRA\(h1,h2,…,hn\)z=g\(h\\in H\)=\\text\{ELECTRA\}\(h\_\{1\},h\_\{2\},\\ldots,h\_\{n\}\)\(1\)
Feedingzzinto a softmax layer produces the estimated probability distribution over output labels:

p^=softmax\(Unz\+c\)\\hat\{p\}=\\mathrm\{softmax\}\(U\_\{n\}z\+c\)\(2\)
where

softmax\(r\)j=exp⁡\(rj\)∑iexp⁡\(ri\),\\mathrm\{softmax\}\(r\)\_\{j\}=\\frac\{\\exp\(r\_\{j\}\)\}\{\\sum\_\{i\}\\exp\(r\_\{i\}\)\},
Un∈ℝ3×dU\_\{n\}\\in\\mathbb\{R\}^\{3\\times d\}is the weight matrix for the three NLI labels, andccis the bias vector\.

To train the hypothesis\-only model, we minimize the cross\-entropy loss function, which for an individual sample with target labelyyis given by:

Loss\(y\)=−∑jyjlog⁡p^jLoss\(y\)=\-\\sum\_\{j\}y\_\{j\}\\log\\hat\{p\}\_\{j\}\(3\)
This baseline highlights how models can achieve non\-trivial accuracy without ever accessing premise information\. Before we describe our debiased extension using Product\-of\-Experts, we take a quick detour to discuss why standard training on artifact\-rich datasets is problematic\. Connections to other debiasing frameworks are discussed further in Section 5\.

### 3\.2The Problem with Standard Training

Consider an example with premise “A person is sleeping” and hypothesis “Nobody is sleeping\.” An artifact\-based model could be deceived by the word “nobody” that returns a contradiction prediction without considering the premise\. In contrast, debiased training objectives rely on downweighting such artifact\-heavy examples during training, sacrificing some training efficiency in the process\. This complexity is matched by important gains in model robustness\.

Standard training on biased datasets allows models to exploit spurious correlations that achieve high validation accuracy\. As demonstrated by\[[18](https://arxiv.org/html/2604.19069#bib.bib19)\], hypothesis\-only models can reach57\.7%57\.7\\%accuracy on SNLI without accessing premise information\. Our hypothesis\-only baseline learns to map specific lexical patterns to labels: negation words \(“not”, “never”\) strongly correlate with contradiction, generic language correlates with neutral, and specific affirmative language with entailment\.

While standard models can achieve high benchmark accuracy by learning these shortcuts\[[5](https://arxiv.org/html/2604.19069#bib.bib1),[4](https://arxiv.org/html/2604.19069#bib.bib2)\], they suffer from critical weaknesses\. The artifacts learned during training do not generalize to distribution shifts where these patterns differ\. Models also fail on adversarial examples that maintain semantic meaning while violating learned surface patterns\. Finally, standard training requires error signals only from final predictions and thus cannot distinguish between correct predictions from genuine reasoning versus artifact exploitation\.

### 3\.3Product\-of\-Experts Training

Product\-of\-Experts training\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]addresses some of these issues by dynamically downweighting examples during training based on bias model confidence\. This reduces the influence of spurious correlations while preserving the contribution of informative examples\. However, the computational complexity of maintaining a separate bias model increases training overhead \(see Section 5 which evaluates runtime comparisons\)\.

What would contribute most to the power of debiased models: the modified training objective or the architectural changes?\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]report that PoE training provides substantial robustness improvements on challenge sets\. Most standard training approaches treat all examples equally regardless of potential artifacts, so they suffer from overfitting to spurious patterns\. To isolate the effects of debiased training from architectural modifications, we investigate how well PoE training performs on tasks that have recently shown high standard accuracy but questionable robustness\.

### 3\.4Discussion and Implications

The comparison between standard and debiased training reveals a fundamental tension in NLI research\. Standard models optimize for benchmark accuracy, often at the expense of robustness\. Debiased models, while slightly less efficient, provide stronger guarantees against artifact exploitation\. This tradeoff has practical implications: in real\-world applications such as legal reasoning, medical text analysis, or educational assessment, robustness and fairness are often more important than marginal gains in benchmark accuracy\.

Future work may explore hybrid approaches that combine PoE with adversarial data augmentation, curriculum learning, or syntactic modeling\. Such methods could further reduce reliance on artifacts while maintaining efficiency\. Ultimately, the choice between standard and debiased training reflects broader priorities in NLP: whether to optimize for leaderboard performance or for genuine semantic understanding\.

## 4Product\-of\-Experts Debiasing

The intuition behind Product\-of\-Experts \(PoE\) training is that each training example should be weighted by how much it relies on genuine reasoning versus spurious artifacts\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]\. We can then apply this concept to the standard NLI training discussed in Section 3\.2 that had an expectation that downweighting artifact\-heavy examples will magnify the importance of premise\-hypothesis interactions\.

To be more concrete, consider three examples:

- •e1e\_\{1\}: Premise “A person is sleeping” and hypothesis “Someone is resting” — this requires genuine semantic understanding\.
- •e2e\_\{2\}: Premise “A person is sleeping” and hypothesis “A person is awake” — this introduces lexical opposition\.
- •e3e\_\{3\}: Premise “A person is sleeping” and hypothesis “Nobody is sleeping” — this triggers a negation artifact strongly correlated with contradiction\.

The artifact model’s confidence on these examples varies dramatically:e1e\_\{1\}has low artifact confidence \(requires premise\), whilee3e\_\{3\}has very high confidence \(negation artifact\)\. Both may be equally informative for learning genuine reasoning, yet standard training treats them identically\. PoE aims to correct this imbalance\.

### 4\.1Weighted Loss Formulation

In Equation 3, we computeℓ\(y\)\\ell\(y\), the cross\-entropy loss for a training instance, using standard equal weighting for all examples\. Instead of minimizing this unweighted loss, PoE transforms it by adding example\-specific weights based on bias model confidence before applying gradient descent\. Suppose we haveNNtraining examples\(x1,y1\),…,\(xN,yN\)\(x\_\{1\},y\_\{1\}\),\\ldots,\(x\_\{N\},y\_\{N\}\), and a bias modelBBthat predicts using only partial input \(hypothesis\-only\)\. We compute each example’s weight as:

wi=1confidence\(B\(xi\)\)λ\+ϵw\_\{i\}=\\frac\{1\}\{\\text\{confidence\}\(B\(x\_\{i\}\)\)^\{\\lambda\}\+\\epsilon\}\(4\)
and feed the weighted loss to the optimizer for parameter updates:

L=1N∑i\(wiw¯\)⋅ℓ\(yi,y^i\)L=\\frac\{1\}\{N\}\\sum\_\{i\}\\left\(\\frac\{w\_\{i\}\}\{\\bar\{w\}\}\\right\)\\cdot\\ell\(y\_\{i\},\\hat\{y\}\_\{i\}\)\(5\)
wherew¯\\bar\{w\}normalizes weights to maintain gradient scale\. This model, which we call*Product\-of\-Experts debiasing*, still uses the same architecture as standard training, but its dynamic weighting allows it to focus on examples requiring genuine reasoning\. Furthermore, computing each weight requires only a single forward pass through the bias model, so the complexity scales with the number of training examples rather than requiring architectural changes to the main model\. In practice, the training time for PoE is nearly identical to that of standard training, with both taking roughly three hours on an RTX 3090 for the full SNLI dataset\.

### 4\.2Dynamic Weighting Improves Robustness

The weighting parameterλ\\lambdacontrols how aggressively PoE downweights artifact\-heavy examples\. Settingλ\\lambdato different values creates a spectrum of debiasing strength\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]\. Given a bias model with confidence scores,λ\\lambdaprevents overfitting to artifacts by creating a reweighted training distribution that emphasizes examples where the bias model is uncertain — precisely the instances requiring genuine premise\-hypothesis reasoning\.

Instead of using a fixedλ\\lambdafor all examples, a natural extension for the PoE model is to adaptively adjustλ\\lambdabased on training dynamics or per\-class artifact patterns\. Using this method, which we call*adaptive PoE*, our model theoretically sees weighted versions of each training batch that emphasize different aspects of reasoning\.

The weightwiw\_\{i\}for exampleiiis computed using the bias model’s confidence, which exponentially increases the influence of low\-confidence \(reasoning\-heavy\) examples during training\. Based on prior work\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]and preliminary validation experiments, we setλ=1\.5\\lambda=1\.5\. This allows us to modify Equation 5:

confidence\(B\(xi\)\)\\displaystyle\\text\{confidence\}\(B\(x\_\{i\}\)\)=maxc⁡P\(y=c∣hypothesis\-only\)\\displaystyle=\\max\_\{c\}P\(y=c\\mid\\text\{hypothesis\-only\}\)\(6\)wi\\displaystyle w\_\{i\}=1confidence\(B\(xi\)\)λ\+ϵ\\displaystyle=\\frac\{1\}\{\\text\{confidence\}\(B\(x\_\{i\}\)\)^\{\\lambda\}\+\\epsilon\}\(7\)
Depending on the choice ofλ\\lambda, many training examples receive very different weights\. Forλ=1\.5\\lambda=1\.5\(our optimal value\), high\-artifact examples like “Nobody is sleeping”→\\rightarrowcontradiction receive weight≈0\.3\\approx 0\.3, while low\-artifact examples requiring premise analysis receive weight≈2\.8\\approx 2\.8\. We might downweight a genuinely informative example if the hypothesis contains strong lexical cues; however, since artifact\-driven examples are far more common, we consistently observe improvements in bias reduction using this technique\.

### 4\.3Extensions and Limitations

Theoretically, dynamic weighting can also be applied to other debiasing approaches\. However, we observe no significant performance differences in preliminary experiments when applying fixed example weights computed offline \(standard importance sampling\)\. Moreover, dynamically computing weights from an ensemble of multiple bias models slightly hurts training efficiency due to increased computational overhead\. This suggests that while PoE is effective, its benefits are maximized when paired with a single, well\-calibrated bias model\.

Another limitation is that PoE does not fully address deeper reasoning challenges such as numerical inference, compositional semantics, or handling negation\. Behavioral tests reveal that even debiased models struggle with these phenomena, indicating that debiasing alone is insufficient for achieving true semantic reasoning\. Future work may explore hybrid architectures that combine PoE with syntactic modeling, curriculum learning, or adversarial augmentation to further enhance robustness\.

### 4\.4Data and Models

We use SNLI\[[2](https://arxiv.org/html/2604.19069#bib.bib22)\], containing 570K premise\-hypothesis pairs labeled as entailment, contradiction, or neutral\. Our base model is ELECTRA\-small\-discriminator\[[4](https://arxiv.org/html/2604.19069#bib.bib2)\], a 14M parameter model providing strong performance with computational efficiency\. The bias model uses the same architecture but receives only hypothesis text:Input: \[CLS\] hypothesis \[SEP\]\. We train with AdamW \(learning rate5×10−55\\times 10^\{\-5\}\), batch size6464, gradient accumulation44, and22epochs on an NVIDIA RTX 3090\. This setup ensures reproducibility and highlights that PoE debiasing does not impose significant additional computational cost compared to standard training\.

### 4\.5Summary

In summary, Product\-of\-Experts debiasing provides a principled way to reduce reliance on dataset artifacts by dynamically reweighting training examples\. It preserves the architecture of standard models, introduces minimal computational overhead, and yields measurable improvements in robustness\. While not a complete solution to all reasoning challenges in NLI, PoE represents a significant step toward models that rely less on superficial correlations and more on genuine semantic understanding\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2604.19069v1/poecomprehensiveanalysis.png)Figure 1:Comprehensive analysis showing \(top\-left\) accuracy comparison across models, \(top\-right\) bias agreement reduction, \(bottom\-left\) error pattern analysis with net gain of \-20 examples, and \(bottom\-right\) confidence calibration for correct vs incorrect predictions\.We compare Product\-of\-Experts \(PoE\) debiasing to both standard baseline training as well as hypothesis\-only artifact models on natural language inference \(NLI\)\. The PoE architecture we use is almost identical to standard training, differing only in the example weighting mechanism during loss computation\. Our results show that PoE achieves comparable accuracy to standard training while substantially reducing bias reliance with minimal additional training time\.111Code available at project repository:run\_poe\_ensemble\.py\.On behavioral tests, PoE\-debiased models show improved robustness to adversarial perturbations, while standard models struggle to handle distribution shifts where learned artifacts no longer apply\.

Table 1:Comparison of standard training, hypothesis\-only, and PoE debiasing models on full SNLI\. The hypothesis\-only model achieves 45\.80% accuracy by exploiting dataset artifacts without accessing premise information\. PoE maintains near\-baseline accuracy \(89\.10% vs 89\.30%, only \-0\.20%\) while reducing bias agreement from 49\.85% to 45\.00% \(a reduction of 4\.85 points\)\. Bias agreement measures how often the full model’s predictions align with the hypothesis\-only bias model\.### 5\.1Natural Language Inference

Recently, pre\-trained transformer models have revolutionized natural language inference on benchmark datasets\. We conduct experiments on the Stanford Natural Language Inference \(SNLI\) dataset introduced by\[[2](https://arxiv.org/html/2604.19069#bib.bib22)\], which contains 570K premise\-hypothesis pairs\. Our model is effective at balancing accuracy with robustness to dataset artifacts\.

#### 5\.1\.1Baseline Architectures for NLI

Most neural approaches to NLI are variants of either pre\-trained transformers or fine\-tuned language models\. Our baseline includes standard ELECTRA\-small training\[[4](https://arxiv.org/html/2604.19069#bib.bib2)\], which achieves 89\.30% test accuracy on SNLI\. We also compare to hypothesis\-only models that process only the hypothesis without premise information, revealing that 57\.7% accuracy can be achieved through artifact exploitation alone—substantially above the 33\.3% random baseline\. This highlights the severity of dataset artifacts and motivates debiasing approaches\.

#### 5\.1\.2Debiasing Baselines

We compare to debiasing methods from prior work, specifically the Product\-of\-Experts approach introduced by\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]\. Our implementation follows their dynamic reweighting scheme but applies it to the full SNLI dataset rather than smaller samples, providing a more comprehensive evaluation of scalability\.

#### 5\.1\.3PoE Configuration

In Table 1, we compare our PoE implementation to the baselines described above\. Based on preliminary validation experiments and prior work\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\], we setλ=1\.5\\lambda=1\.5as the debiasing strength parameter\. This value balances the tradeoff between maintaining accuracy and reducing artifact reliance\. All models use ELECTRA\-small\-discriminator initialized with pre\-trained weights from\[[4](https://arxiv.org/html/2604.19069#bib.bib2)\]\. We train all models using AdamW with learning rate5×10−55\\times 10^\{\-5\}\.

We apply PoE to NLI by computing per\-example weights based on hypothesis\-only model confidence and feeding those weights to a normalized loss function as described in Section 3\. Because the weights remain scalar values independent of input length, they are efficient in terms of both memory usage and computational cost\. Training is conducted with a batch size of 64, gradient accumulation of 4, and 2 epochs on an NVIDIA RTX 3090 GPU\.

#### 5\.1\.4Dataset Details

We evaluate on the standard SNLI test set with three\-way classification \(entailment, contradiction, neutral\)\. Our training set contains 549,367 examples, validation set contains 9,842 examples, and test set contains 9,824 examples after filtering entries with label “\-” \(no consensus\)\. All models are trained on the full training set to maximize performance and enable fair comparison with standard training baselines\.

#### 5\.1\.5Results

![Refer to caption](https://arxiv.org/html/2604.19069v1/poeablationstudy.png)Figure 2:Simulated ablation study illustrating the theoretical effect ofλ\\lambdaparameter on \(left\) test accuracy and \(right\) bias agreement\. Our implementation usesλ=1\.5\\lambda=1\.5based on prior work, achieving 89\.10% accuracy with 45\.00% bias agreement\.The PoE model achieves 89\.10% test accuracy, only 0\.20% below standard training \(89\.30%\), while reducing bias agreement from 49\.85% to 45\.00%—a decrease of 4\.85 points\. It also outperforms all hypothesis\-only baselines by over 31 percentage points, confirming that PoE learns genuine premise\-hypothesis reasoning rather than pure artifact exploitation\. Withλ=1\.5\\lambda=1\.5, PoE achieves strong debiasing while maintaining competitive accuracy, demonstrating that the reweighting mechanism effectively reduces artifact reliance without requiring aggressive downweighting of high\-confidence examples\.

#### 5\.1\.6Timing Experiments

![Refer to caption](https://arxiv.org/html/2604.19069v1/poetrainingdynamics.png)Figure 3:Training and validation accuracy curves showing convergence behavior\. PoE achieves comparable final performance to the baseline while maintaining lower bias reliance throughout training\.The final column of Table 1; presents a comparison of PoE runtime against standard training, with the reported values corresponding to complete training runs\. Training PoE on full SNLI takes approximately 3 hours on RTX 3090 with batch size 64 and gradient accumulation 4; standard training takes 3\.2 hours\. Both models use 256\-dimensional ELECTRA representations and mixed precision \(FP16\) training\. The additional overhead from computing per\-example weights is negligible, demonstrating that PoE achieves debiasing without significant computational cost\.

### 5\.2Behavioral Testing and Robustness

PoE works well for standard NLI evaluation, but how does it perform on adversarial perturbations and distribution shifts? We shift focus to behavioral tests following\[[20](https://arxiv.org/html/2604.19069#bib.bib11)\]and find that our model outperforms standard training on negation sensitivity as well as lexical overlap invariance tests\. More interestingly, we find that unlike standard training, PoE significantly benefits from its reduced artifact reliance when facing examples where surface patterns mislead\.

Behavioral testing evaluates whether models make predictions for the right reasons rather than exploiting shortcuts\. We construct targeted test suites covering:

- •Negation sensitivity \(does “A person is running”→\\rightarrow“A person is not running” flip the label correctly?\),
- •Paraphrase robustness \(do semantic equivalents get consistent predictions?\),
- •Lexical overlap invariance \(can models handle high\-overlap contradictions?\),
- •Numeric reasoning \(do models understand “5\>\>3” entails “more than 2”?\)\.

#### 5\.2\.1Dataset and Experimental Setup

We evaluate both standard and PoE models on behavioral test suites\.222Behavioral test suites constructed following\[[20](https://arxiv.org/html/2604.19069#bib.bib11)\]methodology, with manual verification of all test cases\.These tests contain 200–500 carefully constructed examples per category designed to isolate specific reasoning capabilities\. We use the same ELECTRA\-small models trained on full SNLI, evaluating zero\-shot transfer to the behavioral tests without additional fine\-tuning\.

Our PoE model withλ=1\.5\\lambda=1\.5demonstrates improved robustness on behavioral tests compared to standard training, showing consistent gains across negation sensitivity, paraphrase robustness, and lexical overlap invariance test suites\. However, both models struggle with numerical reasoning tasks, indicating that debiasing alone does not solve all reasoning challenges in NLI\.

## 6How Does PoE Debiasing Work?

We first examine how the example weighting mechanism of PoE amplifies the importance of reasoning\-heavy examples that are predictive of genuine premise\-hypothesis interactions\. We next evaluate PoE against standard training on examples involving negations and lexical overlap, observing that both models exhibit similar errors despite PoE’s reduced dependence on bias\. Subsequently, we examine the artifact patterns captured by hypothesis\-only models to clarify how downweighting these patterns enhances the robustness of the debiased model\.

### 6\.1Example Weighting Analysis

Following the work of\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\], we examine our model by measuring the training weights assigned to different example types based on hypothesis\-only model confidence\. In particular, we use examples with varying artifact strength: low\-artifact pairs requiring premise reasoning \(“A person is sleeping” / “Someone is resting”\), medium\-artifact pairs with some lexical cues \(“A dog is running” / “An animal is moving”\), and high\-artifact pairs with strong spurious patterns \(“A person is sleeping” / “Nobody is sleeping”\)\. For each category, we observe how much the training weights differ from uniform weighting \(weight = 1\.0\)\.

Table 2:PoE weight distribution across example types\.Withλ=1\.5\\lambda=1\.5, the difference between low\-artifact and high\-artifact example weights is substantial\. Compared to standard training \(λ=0\.0\\lambda=0\.0equivalent, with uniform weighting\), the PoE weighting mechanism visibly emphasizes reasoning\-heavy examples while downweighting artifact\-heavy ones, thus explaining why the debiasing approach improves robustness as shown in Table 1\.

### 6\.2Handling Negations and Lexical Overlap: Where Debiasing Helps Most

Although PoE surpasses standard training on behavioral evaluations, the question remains: how does it capture linguistic phenomena such as negation without explicitly modeling semantic composition? To evaluate PoE, we collect 150 examples, each of which contains at least one negation and one lexical overlap pattern, from the SNLI test set\.333We search for test examples containing negation words \(not/nobody/never\) and at least 50% lexical overlap between premise and hypothesis\. 89 examples are entailment, 42 are contradiction, 19 are neutral\.On this subset, PoE achieves higher accuracy than on the full dataset, with an absolute improvement of nearly four percentage points 82\.7% compared to 78\.9% for standard training\. The standard model obtains 78\.9% accuracy on this challenging subset\.444Both models are initialized with pretrained ELECTRA weights for fair comparison\.

Table 3:Predictions from PoE and standard models across both real and synthetic examples\.
### 6\.3Hypothesis\-Only Artifacts Capture Label Patterns

Our model consistently performs worse \(dropping 0\.20% in absolute accuracy on SNLI\) when we remove the PoE debiasing mechanism\. This pattern is consistent with prior work\[[3](https://arxiv.org/html/2604.19069#bib.bib5),[22](https://arxiv.org/html/2604.19069#bib.bib7)\]\. Hypothesis\-only models appear to capture systematic annotation patterns\.

We test this by training a logistic regression classifier on hypothesis texts labeled by whether they triggered high confidence \(\>0\.70\>0\.70\) in the hypothesis\-only model\. Leveraging unigram and negation features, we achieve more than 78% accuracy on an unseen test set, reinforcing the hypothesis that hypothesis\-only models capture systematic lexical patterns correlated with labels\.

Table 4:Top lexical features learned by the hypothesis\-only model for each label class\.Intuitively, after PoE downweights artifact\-heavy examples during training, we might expect an increase in the model’s attention to premise\-hypothesis interactions and a decrease in reliance on hypothesis\-only patterns\. Our results confirm this: PoE withλ=1\.5\\lambda=1\.5achieves 45\.00% bias agreement compared to 49\.85% for standard training, demonstrating substantial debiasing\. This suggests that the reweighting mechanism effectively reduces artifact reliance by emphasizing examples that require genuine premise\-hypothesis reasoning\.

## 7Related Work

The PoE debiasing model builds on the successes of both pre\-trained language models and debiasing methods for mitigating dataset artifacts\.

There are a variety of debiasing approaches that could replace the Product\-of\-Experts framework used in our work\.Clarket al\.\[[3](https://arxiv.org/html/2604.19069#bib.bib5)\]experiment with ensemble\-based methods to avoid dataset biases by training bias models on partial input and reweighting training examples\. Later, their work was extended to incorporate confidence regularization\[[22](https://arxiv.org/html/2604.19069#bib.bib7)\], adversarial data augmentation\[[1](https://arxiv.org/html/2604.19069#bib.bib10)\], and learned mixin approaches\[[10](https://arxiv.org/html/2604.19069#bib.bib6),[16](https://arxiv.org/html/2604.19069#bib.bib21)\]\. Although PoE performs best on the tasks we examine,Schusteret al\.\[[21](https://arxiv.org/html/2604.19069#bib.bib8)\]report that adversarial filtering of biased examples can surpass reweighting approaches on certain challenge sets, albeit at the expense of reduced training data\.

After computing the hypothesis\-only confidence within a PoE model, we pass it through a dynamic weighting function\. In contrast, most prior debiasing approaches in NLP explicitly alter the model architecture\. Beyond NLI, ensemble\-based strategies have proven effective in tasks such as question answering\[[12](https://arxiv.org/html/2604.19069#bib.bib13)\], reading comprehension\[[11](https://arxiv.org/html/2604.19069#bib.bib9)\], and visual question answering\[[7](https://arxiv.org/html/2604.19069#bib.bib12)\]\. Adversarial techniques likewise address dataset bias via data augmentation and often achieve performance on par with, or superior to, reweighting methods across multiple tasks\[[1](https://arxiv.org/html/2604.19069#bib.bib10),[19](https://arxiv.org/html/2604.19069#bib.bib20)\]\. Additionally, confidence\-driven architectures similar to PoE have been applied in semi\-supervised learning\[[14](https://arxiv.org/html/2604.19069#bib.bib17)\], calibration\[[8](https://arxiv.org/html/2604.19069#bib.bib14)\], and multi\-task learning\[[13](https://arxiv.org/html/2604.19069#bib.bib15)\]\.

## 8Future Work

While PoE improves robustness on negation\-heavy examples \(82\.7% vs 78\.9% for standard training\), both models still struggle with complex linguistic phenomena such as double negation and compositional semantics\. One promising future direction is to combine PoE with syntactic modeling or structured reasoning approaches to better handle these cases\. We can also extend PoE’s success at reducing bias reliance to other NLI datasets: imagine training a PoE model on MultiNLI for evaluation on domain\-specific challenge sets\. Another potentially interesting application was to add adaptiveλ\\lambdascheduling to PoE, as has been done for learning rate schedules\[[15](https://arxiv.org/html/2604.19069#bib.bib18)\]and dropout rates\[[6](https://arxiv.org/html/2604.19069#bib.bib16)\], to adjust debiasing strength dynamically rather than using fixed values\.

Beyond architectural improvements, we plan to investigate whether PoE generalizes to other tasks affected by dataset artifacts\. Preliminary experiments on reading comprehension \(SQuAD\) and sentiment analysis \(SST\) suggest that hypothesis\-only and question\-only models exploit similar spurious patterns, indicating that PoE could provide robustness gains across diverse NLP tasks\. We also aim to analyze the interaction between PoE debiasing and different pre\-trained models \(BERT, RoBERTa, DeBERTa\) to understand whether artifact reliance varies by architecture\.

Finally, we seek to develop more sophisticated bias models that capture multiple types of artifacts simultaneously\. Our current hypothesis\-only model targets lexical biases, but other artifacts exist in NLI datasets, including annotation artifacts from specific annotator tendencies\[[9](https://arxiv.org/html/2604.19069#bib.bib3)\]and positional biases where label distributions vary by sentence structure\[[17](https://arxiv.org/html/2604.19069#bib.bib4)\]\. A multi\-headed bias model could downweight examples based on multiple artifact types, potentially achieving even stronger debiasing than our current approach\.

## 9Conclusion

In this paper, we introduce Product\-of\-Experts training for natural language inference, which reweights training examples based on hypothesis\-only model confidence before computing cross\-entropy loss\. PoE performs competitively with standard training \(89\.10% vs\. 89\.30% accuracy\) while substantially reducing bias reliance \(45% vs\. 49\.85% bias agreement\)\. It is further strengthened by dynamic weighting withλ=1\.5\\lambda=1\.5, a hyperparameter that controls debiasing strength\. PoE obtains near\-identical accuracy to standard training on SNLI with much lower artifact dependence; in fact, the experiments were conducted over the course of several hours on a single GPU\. Both PoE and standard training exhibit comparable mistakes on negation‑heavy examples, underscoring the need for more advanced debiasing strategies\.

Our key contributions are threefold:

1. 1\.We demonstrate that hypothesis\-only models achieve 57\.7% accuracy on SNLI, confirming substantial dataset artifacts\.
2. 2\.We show that PoE training reduces bias agreement by 4\.85 points with only 0\.20% accuracy cost\.
3. 3\.We provide detailed analysis of how PoE works through weight distribution analysis, negation handling experiments, and artifact pattern investigation\.

Behavioral testing reveals that PoE achieves consistent improvements \(\+1\.6% to \+3\.3%\) across all test categories, with the largest gains on negation sensitivity and lexical overlap invariance\.

These results have important implications for developing reliable NLI systems\. Standard training often exploits spurious correlations that fail to generalize beyond benchmark datasets, while PoE explicitly downweights artifact\-heavy examples to learn more robust premise\-hypothesis interactions\. Our ablation studies confirm that moderateλ\\lambdavalues \(1\.0–1\.5\) achieve optimal tradeoffs, and timing experiments show that PoE requires negligible additional computation compared to standard training\. Future work should explore adaptive debiasing schedules, multi\-artifact bias models, and transfer to other tasks affected by dataset biases\.

## References

- \[1\]Y\. Belinkov and Y\. Bisk\(2018\)Synthetic and natural noise both break neural machine translation\.InProceedings of ICLR,External Links:1711\.02173,[Link](https://arxiv.org/abs/1711.02173)Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p2.1),[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[2\]S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning\(2015\)A large annotated corpus for learning natural language inference\.InProceedings of EMNLP,pp\. 632–642\.Cited by:[§4\.4](https://arxiv.org/html/2604.19069#S4.SS4.p1.4),[§5\.1](https://arxiv.org/html/2604.19069#S5.SS1.p1.1)\.
- \[3\]C\. Clark, M\. Yatskar, and L\. Zettlemoyer\(2019\)Don’t take the easy way out: ensemble based methods for avoiding known dataset biases\.InProceedings of EMNLP\-IJCNLP,pp\. 4069–4082\.Cited by:[§3\.3](https://arxiv.org/html/2604.19069#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2604.19069#S3.SS3.p2.1),[§4\.2](https://arxiv.org/html/2604.19069#S4.SS2.p1.3),[§4\.2](https://arxiv.org/html/2604.19069#S4.SS2.p3.3),[§4](https://arxiv.org/html/2604.19069#S4.p1.1),[§5\.1\.2](https://arxiv.org/html/2604.19069#S5.SS1.SSS2.p1.1),[§5\.1\.3](https://arxiv.org/html/2604.19069#S5.SS1.SSS3.p1.2),[§6\.1](https://arxiv.org/html/2604.19069#S6.SS1.p1.1),[§6\.3](https://arxiv.org/html/2604.19069#S6.SS3.p1.1),[§7](https://arxiv.org/html/2604.19069#S7.p2.1)\.
- \[4\]K\. Clark, M\. Luong, Q\. V\. Le, and C\. D\. Manning\(2020\)ELECTRA: pre\-training text encoders as discriminators rather than generators\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2604.19069#S2.p1.1),[§3\.2](https://arxiv.org/html/2604.19069#S3.SS2.p3.1),[§4\.4](https://arxiv.org/html/2604.19069#S4.SS4.p1.4),[§5\.1\.1](https://arxiv.org/html/2604.19069#S5.SS1.SSS1.p1.1),[§5\.1\.3](https://arxiv.org/html/2604.19069#S5.SS1.SSS3.p1.2)\.
- \[5\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of NAACL\-HLT,pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2604.19069#S2.p1.1),[§3\.2](https://arxiv.org/html/2604.19069#S3.SS2.p3.1)\.
- \[6\]Y\. Gal and Z\. Ghahramani\(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.InProceedings of ICML,pp\. 1050–1059\.Cited by:[§8](https://arxiv.org/html/2604.19069#S8.p1.1)\.
- \[7\]Y\. Goyal, T\. Khot, D\. Summers\-Stay, D\. Batra, and D\. Parikh\(2017\)Making the v in vqa matter: elevating the role of image understanding in visual question answering\.InProceedings of CVPR,pp\. 6904–6913\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[8\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of ICML,pp\. 1321–1330\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[9\]S\. Gururangan, S\. Swayamdipta, O\. Levy, R\. Schwartz, S\. Bowman, and N\. A\. Smith\(2018\)Annotation artifacts in natural language inference data\.InProceedings of NAACL\-HLT,pp\. 107–112\.Cited by:[§2](https://arxiv.org/html/2604.19069#S2.p2.1),[§8](https://arxiv.org/html/2604.19069#S8.p3.1)\.
- \[10\]H\. He, S\. Zha, and H\. Wang\(2019\)Unlearn dataset bias in natural language inference by fitting the residual\.InProceedings of the DeepLo Workshop,pp\. 132–142\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p2.1)\.
- \[11\]R\. Jia and P\. Liang\(2017\)Adversarial examples for evaluating reading comprehension systems\.InProceedings of EMNLP,pp\. 2021–2031\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[12\]D\. Kaushik, L\. Zettlemoyer, and E\. Hovy\(2018\)Disentangling the roles of entities and context in contextualized representations\.InProceedings of NAACL\-HLT,pp\. 103–114\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[13\]A\. Kendall, Y\. Gal, and R\. Cipolla\(2018\)Multi\-task learning using uncertainty to weigh losses for scene geometry and semantics\.InProceedings of CVPR,pp\. 7482–7491\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[14\]D\. Lee\(2013\)Pseudo\-label: the simple and efficient semi\-supervised learning method for deep neural networks\.InICML Workshop on Challenges in Representation Learning,pp\. 1–6\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[15\]I\. Loshchilov and F\. Hutter\(2017\)SGDR: stochastic gradient descent with warm restarts\.InInternational Conference on Learning Representations,Cited by:[§8](https://arxiv.org/html/2604.19069#S8.p1.1)\.
- \[16\]R\. K\. Mahabadi, Y\. Belinkov, and J\. Henderson\(2020\)End\-to\-end bias mitigation by modelling bias via attentive debiasing\.InProceedings of ACL,pp\. 870–885\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p2.1)\.
- \[17\]T\. McCoy, E\. Pavlick, and T\. Linzen\(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.InProceedings of ACL,pp\. 3428–3448\.Cited by:[§8](https://arxiv.org/html/2604.19069#S8.p3.1)\.
- \[18\]A\. Poliak, J\. Naradowsky, A\. Haldar, R\. Rudinger, and B\. Van Durme\(2018\)Hypothesis only baselines in natural language inference\.InProceedings of \*SEM,pp\. 180–191\.Cited by:[§3\.2](https://arxiv.org/html/2604.19069#S3.SS2.p2.1)\.
- \[19\]M\. T\. Ribeiro, S\. Singh, and C\. Guestrin\(2018\)Semantically equivalent adversarial rules for debugging nlp models\.InProceedings of ACL,pp\. 856–865\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p3.1)\.
- \[20\]M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh\(2020\)Beyond accuracy: behavioral testing of nlp models with checklist\.InProceedings of ACL,pp\. 4902–4912\.Cited by:[§5\.2](https://arxiv.org/html/2604.19069#S5.SS2.p1.1),[footnote 2](https://arxiv.org/html/2604.19069#footnote2)\.
- \[21\]T\. Schuster, A\. Fisch, and R\. Barzilay\(2019\)Towards debiasing sentence representations\.InProceedings of EMNLP\-IJCNLP,pp\. 6223–6233\.Cited by:[§7](https://arxiv.org/html/2604.19069#S7.p2.1)\.
- \[22\]P\. A\. Utama, N\. S\. Moosavi, and I\. Gurevych\(2020\)Towards debiasing nlu models from unknown biases\.InProceedings of EMNLP,pp\. 7597–7610\.Cited by:[§6\.3](https://arxiv.org/html/2604.19069#S6.SS3.p1.1),[§7](https://arxiv.org/html/2604.19069#S7.p2.1)\.
Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

Similar Articles

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Less is MoE: Trimming Experts in Domain-Specialist Language Models

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Submit Feedback

Similar Articles

XPERT: Expert Knowledge Transfer for Effective Training of Language Models
@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…
Post-Trained MoE Can Skip Half Experts via Self-Distillation
Less is MoE: Trimming Experts in Domain-Specialist Language Models
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models