No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

arXiv cs.CL Papers

Summary

A research paper proposing NWCAD (No-Worse Context-Aware Decoding), a decode-time adapter that prevents 'neutral regression' where LLMs overwrite already-correct answers when given non-informative context, using a two-stream architecture with gated fallback to no-context decoding.

arXiv:2604.16686v1 Announce Type: new Abstract: Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and quantify it by measuring accuracy drops on baseline-correct items under answer-consistent contexts. We propose No-Worse Context-Aware Decoding (NWCAD), a decode-time adapter built on a two-stream setup with a two-stage gate: it backs off to no-context decoding when the context is non-informative, and otherwise uses context-conditioned decoding with a CAD-style fallback under uncertainty. We evaluate NWCAD on benchmarks that separate do-no-harm reliability from context utilization (accuracy gains on genuinely helpful contexts). NWCAD prevents neutral regression on baseline-correct items while preserving strong context-driven accuracy on helpful contexts.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:03 AM

# No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
Source: [https://arxiv.org/html/2604.16686](https://arxiv.org/html/2604.16686)
Yufei Tao Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain &Ameeta Agrawal Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 3 email@domain Yufei Tao Ameeta Agrawal Department of Computer Science, Portland State University, USA \{yutao, ameeta\}@pdx\.edu

###### Abstract

Large language models \(LLMs\) can answer questions and summarize documents when conditioned on external contexts \(e\.g\., retrieved evidence\), yet context use remains unreliable: models may overwrite an already\-correct output \(*neutral regression*\) even when the context is non\-informative\. We formalize neutral regression as a do\-no\-harm requirement and quantify it by measuring accuracy drops on baseline\-correct items under answer\-consistent contexts\. We propose No\-Worse Context\-Aware Decoding \(NWCAD\), a decode\-time adapter built on a two\-stream setup with a two\-stage gate: it backs off to no\-context decoding when the context is non\-informative, and otherwise uses context\-conditioned decoding with a CAD\-style fallback under uncertainty\. We evaluate NWCAD on benchmarks that separate do\-no\-harm reliability from*context utilization*\(accuracy gains on genuinely helpful contexts\)\. NWCAD prevents neutral regression on baseline\-correct items while preserving strong context\-driven accuracy on helpful contexts\.

No\-Worse Context\-Aware Decoding: Preventing Neutral Regression in Context\-Conditioned Generation

Yufei Tao Ameeta AgrawalDepartment of Computer Science, Portland State University, USA\{yutao, ameeta\}@pdx\.edu

## 1Introduction

Large language models \(LLMs\) can answer questions and generate summaries when given external contexts, but context\-conditioned generation is not automatically reliable\. In many context\-conditioned workflows like retrieval\-augmented generation \(RAG\)Lewis et al\. \([2020](https://arxiv.org/html/2604.16686#bib.bib8)\); Izacard and Grave \([2021](https://arxiv.org/html/2604.16686#bib.bib3)\)and user\-provided context, the provided passage is often*partially*relevant: it may mention related entities, type\-matched facts, or near\-miss numbers, even when it does not actually entail the correct answer\. In such situations, conditioning on context can subtly shift the model’s next\-token distribution and cause an avoidable change in the final output\.

We call this failure mode*neutral regression*: the model overwrites an already\-correct answer even though the context is effectively non\-informative\. Neutral regression is easy to miss in standard evaluations under aggregate accuracy, motivating evaluations that separate neutral and helpful\-context cases\. Symmetrically, we also care about context utilization: when the context is genuinely helpful, a decoder should be able to use it rather than over\-trusting the model’s parametric answer\.

![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/pipeline.png)Figure 1:Overview of NWCAD showing two stream inputs when the no\-context stream is confident, it keeps the no\-context decision \(preventing neutral regression\); when the with\-context stream is confident, it uses with\-context decoding; otherwise it invokes a CAD\-style fallback decoder\.A common remedy is context\-aware decoding: contrast the model’s distribution with and without the context, and*tilt*generation toward tokens boosted by context \(e\.g\., CAD and adaptive variants such as AdaCAD and CoCoAShi et al\. \([2023](https://arxiv.org/html/2604.16686#bib.bib18)\); Wang et al\. \([2025a](https://arxiv.org/html/2604.16686#bib.bib23)\); Khandelwal et al\. \([2025](https://arxiv.org/html/2604.16686#bib.bib4)\)\)\. While effective on average in conflict cases, these decoders lack a do\-no\-harm guarantee: even when the base model is correct and the context provides weak or noisy evidence, a small distributional shift can change a token choice and cascade into a different \(and sometimes wrong\) answer\.

Because standard benchmarks mix neutral and conflict contexts, neutral regressions can be obscured by averaged accuracy; we therefore evaluate separately on \(i\) questions that are already answered correctly without context \(do\-no\-harm\) and \(ii\) questions where the context can correct the answer \(*context utilization*\)\. This separation also makes the core trade\-off explicit: when the context is non\-informative, a decoder should preserve the no\-context output; when it is informative, it should shift toward context to correct the answer\. A practical decoder should therefore behave like an explicit adapter with an exact no\-context backoff, rather than always applying a logit tilt\. This exact backoff capability is a qualitative difference from continuous logit\-tilting decoders: if a method always perturbs logits, it cannot guarantee reproducing the no\-context output on neutral inputs\.

To address this do\-no\-harm vs\. context utilization tension, we proposeNo\-Worse Context\-Aware Decoding \(NWCAD\), a decode\-time adapter built on a two\-stream setup with a two\-stage gate that yields a three\-way routing decision \(Figure[1](https://arxiv.org/html/2604.16686#S1.F1)presents an overview\): it first decides when to ignore the context and keep the no\-context decoding \(preventing neutral regression\), and when it does use the context, it routes between standard context\-conditioned decoding and a contrastive decoder when the context signal is uncertain\. Our main contributions are as follows:

- •systematically characterizing neutral regression and building controlled evaluations that separate do\-no\-harm reliability from context utilization;
- •proposing NWCAD111[https://github\.com/CastGryff/NWCAD](https://github.com/CastGryff/NWCAD), a two\-stage gate that provably preserves the no\-context output whenever it selects the backoff branch \(i\.e\., setsz′⁣t=z0tz^\{\\prime t\}=z\_\{0\}^\{t\}\) under greedy decoding;
- •evaluating NWCAD across multiple models and a diverse QA benchmark suite to test generalizability; and
- •showing empirically that most context\-aware decoding largely reduces to*regime selection*between no\-context and with\-context decoding, with contrastive mixing rarely needed\.

## 2Related Work

#### Retrieval\-augmented generation\.

Retrieval\-augmented generation \(RAG\) couples context retrieval with conditional generation to improve factuality and access up\-to\-date informationLewis et al\. \([2020](https://arxiv.org/html/2604.16686#bib.bib8)\); Izacard and Grave \([2021](https://arxiv.org/html/2604.16686#bib.bib3)\)\. Recent work highlights that RAG systems can be brittle to distractor evidence and context dilution, motivating context selection and fixed\-budget evidence assemblyLi and Ouyang \([2024](https://arxiv.org/html/2604.16686#bib.bib9)\); Lahmy and Yozevitch \([2025](https://arxiv.org/html/2604.16686#bib.bib6)\); Iratni et al\. \([2025](https://arxiv.org/html/2604.16686#bib.bib2)\)\.

#### Contextual grounding and verification\.

A broad line of work studies how to make generation faithful to provided evidence, including fine\-grained evaluation of factuality/faithfulness and lightweight verification against grounding documentsMaynez et al\. \([2020](https://arxiv.org/html/2604.16686#bib.bib13)\); Kryscinski et al\. \([2020](https://arxiv.org/html/2604.16686#bib.bib5)\); Min et al\. \([2023](https://arxiv.org/html/2604.16686#bib.bib14)\); Zhang et al\. \([2024](https://arxiv.org/html/2604.16686#bib.bib26)\); Tang et al\. \([2024a](https://arxiv.org/html/2604.16686#bib.bib19)\); Tao et al\. \([2025](https://arxiv.org/html/2604.16686#bib.bib21)\)\. Recent RAG\-centric frameworks further emphasize context\-faithful behavior under real retrieval noiseNguyen et al\. \([2024](https://arxiv.org/html/2604.16686#bib.bib16)\)\.

#### Selective answering and risk control\.

Selective prediction/abstention trades off risk against coverage by refusing to answer \(or filtering generations\) when uncertainty is highTomani et al\. \([2024](https://arxiv.org/html/2604.16686#bib.bib22)\); Nie et al\. \([2024](https://arxiv.org/html/2604.16686#bib.bib17)\); Wang et al\. \([2025b](https://arxiv.org/html/2604.16686#bib.bib24)\)\. These methods typically operate at the response level \(e\.g\., abstain or sample\-then\-filter\) and provide complementary mechanisms for controlling errors under uncertainty\.

#### Context\-aware decoding\.

A separate line of work modifies decoding directly by contrasting with\-context and no\-context distributions\. Context\-Aware Decoding \(CAD\)Shi et al\. \([2023](https://arxiv.org/html/2604.16686#bib.bib18)\)biases toward tokens that become more likely when context is present\. Adaptive CAD \(AdaCAD\)Wang et al\. \([2025a](https://arxiv.org/html/2604.16686#bib.bib23)\)varies the tilt strength with the divergence between the two distributions, reducing but not eliminating over\-correction\. CoCoAKhandelwal et al\. \([2025](https://arxiv.org/html/2604.16686#bib.bib4)\)adds confidence signals \(e\.g\., entropy and margin/peakedness\) to modulate tilt dynamically\. These methods improve average factuality, yet they share two limitations\. First, they offer*no formal guarantee of non\-regression*: even on neutral inputs where the base model was already correct and the context adds no new information, they can still alter the output\. Second, their continuous logit reweighting can still flip token choices in low\-conflict settings, even when context provides little new information\.

NWCAD is a decode\-time adapter for context\-conditioned generation that targets neutral regression\. Relative to response\-level abstention or verifier\-based approaches, it requires no additional models and operates directly on the token distribution\. Relative to CAD/AdaCAD/CoCoA\-style two\-stream context\-aware decoding, it adds an explicit backoff to no\-context decoding on low\-divergence steps, yielding a per\-step no\-neutral\-regression under greedy decoding and making the do\-no\-harm vs\. context utilization trade\-off tunable via simple thresholds\.

## 3Methodology

In this section we formalize our problem setting and introduce No\-Worse Context\-Aware Decoding \(NWCAD\)\.

### 3\.1Setup and Definitions

We assume a left\-to\-right autoregressive language model with vocabulary𝒱\\mathcal\{V\}and consider generation given a fixed prompt \(e\.g\., a question\) and an optional context \(e\.g\., a retrieved passage\)\. At decoding steptt, both the with\-context and no\-context streams share the same generated prefix, but differ in whether the external context is included\.

Letzct∈ℝ\|𝒱\|z\_\{c\}^\{t\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}be the logits obtained by conditioning on the*contextual input*\(context \+ prompt \+ prefix\), and letz0tz\_\{0\}^\{t\}be the logits obtained by conditioning on the*context\-free input*\(prompt \+ prefix only\)\. We denote the corresponding next\-token distributions by

pct=softmax​\(zct\),p0t=softmax​\(z0t\)\.p\_\{c\}^\{t\}=\\text\{softmax\}\(z\_\{c\}^\{t\}\),\\quad p\_\{0\}^\{t\}=\\text\{softmax\}\(z\_\{0\}^\{t\}\)\.
We use two lightweight signals to decide when to trust each stream: \(i\) a*context pressure*score based on divergence, and \(ii\) a simple*confidence*score based on the top\-1 margin\.

For divergence, letDtD^\{t\}denote the Jensen\-Shannon divergence between the two token distributions at steptt,Dt=JS​\(pct∥p0t\)D^\{t\}=\\text\{JS\}\(p\_\{c\}^\{t\}\\,\\\|\\,p\_\{0\}^\{t\}\)\. Since computing JS over the full vocabulary is expensive, we approximate it over the union of the top\-KKtokens from both streams \(we useK=50K\{=\}50\)\. This top\-KKapproximation closely matches full\-vocab JS and does not change neutrality/backoff decisions on a representative QA sample \(See Appendix[A](https://arxiv.org/html/2604.16686#A1)\)\. We use low divergence as a proxy for “neutral” steps \(context not materially changing the next\-token decision\) and high divergence as “conflict” steps \(context exerting pressure\):

Neutral​\(t\)=\{Dt≤τ\}\.\\text\{Neutral\}\(t\)\\;=\\;\\\{\\,D^\{t\}\\leq\\tau\\,\\\}\.
DtD^\{t\}measures how much the context changes the model’s next\-token distribution\. WhenDtD^\{t\}is small, the two streams agree; this is consistent with a non\-informative context, but it is not sufficient on its own \(e\.g\., both streams may be uncertain yet similar\)\. We therefore combine divergence with a confidence margin before treating a step as safe to back off\.

For confidence, we use the top\-1 margin of a distribution, i\.e\., the probability gap between its most likely and second\-most likely tokens\. We denote this margin byp1−p2p\_\{1\}\-p\_\{2\}and use it as a proxy for “how decisive” a stream is at a given step\. Margins let us distinguish agreement with high confidence \(safe backoff\) from agreement under uncertainty \(route to the context side\)\.

### 3\.2From Context\-Aware Decoding to No\-Worse Decoding

Context\-aware decoding methods such as CAD, AdaCAD, and CoCoA combine a context\-free stream and a with\-context stream by*tilting*toward tokens that become more likely when context is present\. A representative form is

zCADt=zct\+α​\(zct−z0t\)=\(1\+α\)​zct−α​z0t,z\_\{\\text\{CAD\}\}^\{t\}\\;=\\;z\_\{c\}^\{t\}\\;\+\\;\\alpha\\big\(z\_\{c\}^\{t\}\-z\_\{0\}^\{t\}\\big\)\\;=\\;\(1\+\\alpha\)z\_\{c\}^\{t\}\-\\alpha z\_\{0\}^\{t\},with a fixedα\\alpha\(CAD\) or an adaptive scheduleαt\\alpha^\{t\}based on divergence or confidence signals \(AdaCAD, CoCoA\)\. These methods can improve conflict cases on average, but they introduce a key failure mode:*neutral regression*\. Even when the context adds no useful information, the tilt is still applied, so small distributional differences can change a token choice and cascade into a different sequence\.

In contrast, NWCAD is a adapter rather than a new tilting rule: it can exactly back off to the no\-context stream on neutral steps and otherwise routes to either standard with\-context decoding or a CAD\-style fallback decoder\. NWCAD addresses neutral regression by backing off to no\-context decoding when the context appears non\-informative, while still allowing context\-driven corrections when the context is informative\.

The neutrality thresholdτ\\taucontrols this trade\-off: largerτ\\taumakes the decoder more conservative \(more backoff to the no\-context stream\), while smallerτ\\tauincreases context utilization\. Unlike continuous tilting \(which always perturbs logits\), an explicit gate can copyz0tz\_\{0\}^\{t\}exactly; under greedy decoding, this prevents “small” context pressure from flipping an argmax token and cascading into a different answer\.

### 3\.3No\-Worse Context\-Aware Decoding \(NWCAD\)

NWCAD maintains two parallel forward passes \(with context vs\. without\) and uses a two\-stage gate to choose which logits to decode from at each step\.

#### Stage 1 \(BC gate; baseline\-correct\): exact backoff to the no\-context stream\.

When the two distributions agree \(Dt≤τD^\{t\}\\leq\\tau\) and the no\-context stream is decisive \(p0,1t−p0,2t≥κprip\_\{0,1\}^\{t\}\-p\_\{0,2\}^\{t\}\\geq\\kappa\_\{\\text\{pri\}\}\), we copy logits from the no\-context stream:z′⁣t=z0tz^\{\\prime t\}=z\_\{0\}^\{t\}\. Under greedy decoding, whenever Stage 1 applies and we setz′⁣t=z0tz^\{\\prime t\}=z\_\{0\}^\{t\}, the next token is identical to the no\-context stream at steptt\(immediate sincearg⁡maxi⁡zi′⁣t=arg⁡maxi⁡z0,it\\arg\\max\_\{i\}z^\{\\prime t\}\_\{i\}=\\arg\\max\_\{i\}z\_\{0,i\}^\{t\}\)\. Consequently, if Stage 1 applies at every step, the full decoded sequence matches the no\-context output exactly\.

#### Stage 2 \(CC gate; context\-confident\): use context when decisive; fallback otherwise\.

If Stage 1 does not apply, we route to the context side\. If the with\-context stream is decisive \(pc,1t−pc,2t≥κctxp\_\{c,1\}^\{t\}\-p\_\{c,2\}^\{t\}\\geq\\kappa\_\{\\text\{ctx\}\}\), we decode fromzctz\_\{c\}^\{t\}; otherwise we decode from a plug\-in CAD\-style fallback decoderzfallbacktz\_\{\\text\{fallback\}\}^\{t\}\.

#### Decision flow\.

At each token, NWCAD either \(i\) preserves the no\-context decision \(Stage 1\), or \(ii\) uses the context stream when it is confident, and only falls back to a stronger contrastive decoder on a small set of uncertain steps \(Stage 2\)\.

The per\-step logic can be summarized as:

z′⁣t=\{z0t,Dt≤τ∧\(p0,1t−p0,2t\)≥κprizct,\(pc,1t−pc,2t\)≥κctxzfallbackt,CAD\-style fallbackz^\{\\prime t\}\\;=\\;\\begin\{cases\}z\_\{0\}^\{t\},&D^\{t\}\\leq\\tau\\ \\land\\ \(p\_\{0,1\}^\{t\}\-p\_\{0,2\}^\{t\}\)\\geq\\kappa\_\{\\text\{pri\}\}\\\\ z\_\{c\}^\{t\},&\(p\_\{c,1\}^\{t\}\-p\_\{c,2\}^\{t\}\)\\geq\\kappa\_\{\\text\{ctx\}\}\\\\ z\_\{\\text\{fallback\}\}^\{t\},&\\text\{CAD\-style fallback\}\\end\{cases\}
Stage 2 starts only when Stage 1 does not select the no\-context stream\. At that point, NWCAD either uses standard with\-context decoding if the with\-context stream is confident, or it uses the CAD\-style fallback otherwise\. In the ablations below, we therefore distinguish NWCADBC\{\}\_\{\\text\{BC\}\}\(Stage 1 only\) which is a no\-fallback variant and the full two\-stage NWCAD \(Stage 1 BC gate \+ Stage 2 CC gate\) with a plug\-in fallback decoder; unless stated otherwise, the fallback is CoCoA\. We default to CoCoA because it reports strong conflict\-setting performance in its original study; however, the fallback is plug\-in and can be replaced \(e\.g\., CAD/AdaCAD\), which we report as ablations\. We use NWCADBC\{\}\_\{\\text\{BC\}\}for the Stage 1\-only ablation \(BC gate only\), and NWCADXwhen instantiating the fallback decoder as a specific two\-stream decoderXX\(e\.g\., CAD/AdaCAD/CoCoA\)\.

QuestionWhen were the Articles of Confederation put into effect?Gold answerMarch 1, 1781Restated contextThe Articles of Confederation officially went into effect on March 1, 1781, marking the first governing framework for the newly independent states\.Distractor contextThe Articles of Confederation were officially adopted after several years of debate among the thirteen states\. Although they were drafted in 1777, it took until 1780 for all states to ratify the articles and put the new confederation government into operation\.Table 1:Example restated and distractor contexts from augmented NQ\-open\. Distractors add type\-matched pressure without entailing the gold answer\.

## 4Part I: Controlled Evaluation on Augmented NQ\-open

Part I uses controlled datasets to diagnose neutral regression and tune a small set of gating thresholds on this diagnostic benchmark\. We then freeze these thresholds and evaluate on full\-slice QA and beyond\-QA tasks in Part II\.

### 4\.1Augmented NQ\-open Benchmark

We start from NQ\-openLee et al\. \([2019](https://arxiv.org/html/2604.16686#bib.bib7)\), an open\-domain QA dataset with short answers grounded in Wikipedia\. For each question, we attach two types of additional context: \(1\)restatedcontexts, which restate the gold fact and are answer\-consistent, and \(2\)distractorcontexts, which introduce type\-matched but incorrect information \(e\.g\., a nearby year\), creating plausible but misleading context\. These slices are designed to test whether models avoid regressions when context is not helpful\. Table[1](https://arxiv.org/html/2604.16686#S3.T1)provides an example\. We also construct ahelpfulslice by selecting questions that the base model answers incorrectly and adding restated contexts intended to help\. This results in 651 restated, 763 distractor, and 637 helpful examples in the full pool\.

We automatically verify that restated and helpful contexts contain the gold short answer span \(100% of released examples\), and that distractor contexts exclude the gold answer after SQuAD normalization \(99\.3%\) while still including the intended type\-matched distractor \(98\.5%\)\.

Why use controlled subsets? On the full dataset, changes in a model’s answer are hard to interpret: the baseline may already be wrong, or the context may be incorrect or only weakly relevant, so it is unclear whether a change is an improvement or a regression\. To avoid this ambiguity, we evaluate on controlled subsets where the desired behavior is clear\. For the restated and distractor settings, we use only*baseline\-correct*examples, where the no\-context baseline matches the gold answer with high confidence \(min\_prefix\_margin≥0\.8\\geq 0\.8over the first 8 tokens\)\. In this case, the correct behavior is to preserve the baseline answer\. For the helpful setting, we use examples where the baseline is wrong but standard decoding with context is correct, so improvements directly reflect better use of context\. In total, we evaluate on 900 examples, with 300 in each subset\.

### 4\.2Models and decoding setup

We evaluate on three open\-weight instruction\-tuned models: Llama\-3\.1\-8B\-Instruct and Llama\-3\.1\-70B\-InstructAI@Meta \([2024](https://arxiv.org/html/2604.16686#bib.bib1)\), and Ministral\-3\-8B\-Instruct\-2512Mistral AI \([2025](https://arxiv.org/html/2604.16686#bib.bib15)\)\. We use greedy decoding with max\_new\_tokens=32 for all methods to isolate decode\-time effects\. For two\-stream methods, we run both a with\-context and no\-context forward pass; NWCAD uses a top\-KKunion approximation \(K=50K\\\!=\\\!50\) for JS divergence and uses CoCoA as the fallback decoder in our main results \(i\.e\., NWCADCoCoA\{\}\_\{\\text\{CoCoA\}\}\)\. We tune the thresholds on Llama\-3\.1\-8B controlled slices and then*reuse*the same settings for Ministral\-3\-8B and Llama\-3\.1\-70B \(more details on the tuning experiments are included in the Appendix Table[B\.1](https://arxiv.org/html/2604.16686#A2.T1)\.

To test whether neutral regression generalizes beyond open\-weight models, we additionally evaluate No\-context vs\. With\-context on two API models \(gpt\-5\-mini\-2025\-08\-07andgpt\-5\.2\-2025\-12\-11\) under deterministic decoding; since token\-level logits are not exposed, we do not run NWCAD/CAD\-style decoders for these API models\.

![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/context_effect_delta_bars_abs.png)Figure 2:QA accuracy \(no\-context and with\-context\) across models and three controlled slices\. Averaged across models, adding context decreases accuracy on neutral \(Restated/Distractor\) but increases accuracy on helpful examples where the baseline is wrong\.
### 4\.3Baseline methods

We compare decoding without context \(no\-context\), standard with\-context decoding \(with\-context\), and context\-aware decoders \(CAD, AdaCAD, CoCoA\), alongside NWCAD\.

### 4\.4Metrics

On QA we report exact\-match accuracy \(SQuAD\-normalized\), which evaluates short\-answer matches after standard normalization \(lowercasing and removing punctuation and articles\)\. We compute EM on a short answer extracted from the model output; Appendix[C](https://arxiv.org/html/2604.16686#A3)details the prompting and extraction protocol\. For controlled\-slice summaries, we use a micro\-averaged \(count\-weighted\) average across slices\.

![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/nwr_bubble_range.png)Figure 3:Controlled QA tradeoff between neutral preservation and context utilization \(accuracy; %\)\. Each panel plots Helpful accuracy \(x\-axis\) against neutrality on BC subsets \(y\-axis; marker at mean of Restate/Distractor, vertical bar spans the two\)\. NWCAD moves up and right across all models, improving both neutrality under distractors and helpful gains on helpful contexts\.Table 2:Qualitative examples illustrating neutral regression and context correction\. NWCAD preserves baseline\-correct answers on neutral contexts and uses context when it is helpful\.
### 4\.5Results and Discussion

We organize results around the do\-no\-harm vs\. context utilization tension: preserve baseline\-correct answers when the context is non\-informative, while leveraging context when it is helpful\.

#### Neutral regression across model families\.

We first validate that neutral regression is not specific to any single model family or to CAD\-style contrastive decoding\. Figure[2](https://arxiv.org/html/2604.16686#S4.F2)compares the no\-context baseline against standard with\-context decoding on baseline\-correct \(BC\) neutral subsets: since BC filtering ensures the baseline matches the gold short answer, baseline achieves 100% accuracy by construction; any drop under with\-context decoding reflects an avoidable regression induced by a not\-helpful \(answer\-consistent\) context\. Across three open\-source models and two API models, with\-context decoding exhibits consistent regressions especially under distractor\-hard contexts \(middle bar\), while simultaneously improving accuracy on helpful contexts \(rightmost bar\)\. This empirical trade\-off motivates NWCAD’s decode\-time gating\.

#### Separating do\-no\-harm from context utilization\.

Figure[3](https://arxiv.org/html/2604.16686#S4.F3)reports accuracy on controlled QA slices that isolate the two objectives\. Across all evaluated open\-weight models, NWCAD achieves the best weighted average by improving both neutral preservation and context utilization\. In particular, continuous tilting methods appear to over\-react to distractor contexts \(large drops on distractor\-hard BC\), whereas NWCAD’s explicit backoff prevents these avoidable regressions while still allowing corrections on the helpful subset\.

![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/dumbbell_by_dataset.png)Figure 4:Full\-slice results across QA and beyond\-QA benchmarks \(higher is better\)\. Restate/Distractor/Helpful through TabMWP report EM; ToFuEval uses AlignScore; ExpertQA reports ROUGE\-L and BERTScore\-P\. NQ\-SYNTH/NQ\-SWAP are context\-defined, so near\-zero Baseline accuracy is expected\. For each task, the vertical bar shows the Baseline→\\rightarrowNWCAD score lift, with other methods plotted for reference\.

### 4\.6Qualitative Analysis

Table[2](https://arxiv.org/html/2604.16686#S4.T2)illustrates some representative cases from our augmented NQ\-open benchmark\. The first two examples show*neutral regression*: a context contains type\-matched but non\-decisive information \(e\.g\., a related year\), and CAD\-style tilting drifts away from a correct base answer\. The final example shows the*helpful\-context*setting, where the context is genuinely informative and NWCAD allows a correction\.

## 5Part II: Full\-slice Evaluation

Part II evaluates on full\-slice benchmarks that mix neutral/conflict and retrieval noise\. We use the same models, decoding setup, and metrics as in Part I unless otherwise noted\.

### 5\.1Benchmarks

#### Full\-slice QA suite\.

After diagnosing and tuning on controlled slices, we evaluate on 12 full slice QA benchmarks that mix neutral and conflict contexts\. We report exact match QA accuracy on NQ SYNTH and NQ SWAP with synthetic or swapped contextsWang et al\. \([2025a](https://arxiv.org/html/2604.16686#bib.bib23)\); Khandelwal et al\. \([2025](https://arxiv.org/html/2604.16686#bib.bib4)\), HotpotQA distractor and HotpotQA support which are multi hop QA settings with distractor versus supporting retrieval contextsYang et al\. \([2018](https://arxiv.org/html/2604.16686#bib.bib25)\), NQ val short which is the original NQ open validation short answer set, PopQA which is an open domain QA benchmarkMallen et al\. \([2023](https://arxiv.org/html/2604.16686#bib.bib12)\), and TabMWP which evaluates table as context math word problemsLu et al\. \([2023](https://arxiv.org/html/2604.16686#bib.bib10)\)\. For efficiency, we use 1,000 example subsets for NQ SWAP and HotpotQA\. Following the paired no context and with context prompt design used by prior context\-aware decoding work such as CAD AdaCAD and CoCoA, we evaluate all open weight decoders on identical prompt pairs for each dataset example and vary only the decode time rule\.

#### Beyond QA\.

To test whether NWCAD generalizes beyond short answer QA, we additionally evaluate on two more datasets: ToFuEval \(topic\-focused dialogue summarization\)Tang et al\. \([2024b](https://arxiv.org/html/2604.16686#bib.bib20)\)and ExpertQA \(expert\-curated long\-form answers\)Malaviya et al\. \([2024](https://arxiv.org/html/2604.16686#bib.bib11)\), reporting AlignScore on ToFuEval and ROUGE\-L/BERTScore\-P on ExpertQA\.

### 5\.2Results and Discussion

Figure[4](https://arxiv.org/html/2604.16686#S4.F4)summarizes the main results \(Appendix[D](https://arxiv.org/html/2604.16686#A4)reports the full per\-dataset numbers\)\. We observe that NWCAD improves robustness on the augmented slices while remaining competitive \(and often improving\) on the diverse QA benchmarks\. The largest improvements occur in distractor heavy settings such as Distractor hard and HotpotQA distractor, consistent with the neutrality gate preventing type matched but non decisive context from overriding correct answers\. Gains are not limited to conservative backoff: NWCAD also improves Helpful and context defined tasks \(NQ SYNTH and NQ SWAP\), indicating the adapter still exploits context when the context stream is confident\. Because exact match can be overly strict for short free\-form answers, we also evaluate the QA outputs from all three open\-weight backbones with GPT\-4o\-mini as an LLM judge \(Appendix[E](https://arxiv.org/html/2604.16686#A5)\)\. In the updated semantic\-evaluation table, the same overall conclusion still holds: NWCAD is the strongest overall method across the three backbones, with especially clear gains on Restate\-hard and Distractor\-hard\. This indicates that the improvements are not simply an artifact of strict EM scoring\.

Across all three models, NWCAD also achieves the best ToFuEval AlignScore and improves ExpertQA ROUGE\-L/BERTScore\-P over both With\-context and contrastive decoding baselines, suggesting the adapter generalizes beyond short\-answer QA\.

## 6Ablations

We report additional analyses on component ablations \(NWCADBC\{\}\_\{\\text\{BC\}\}vs\. full NWCAD\), NWCAD as an adapter over CAD/AdaCAD/CoCoA, routing frequency \(how often the fallback decoder is used\), and latency relative to the corresponding base decoder\.

![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/component_ablation_improvement_abs.png)Figure 5:NWCADBC\{\}\_\{\\text\{BC\}\}vs\. NWCAD\.![Refer to caption](https://arxiv.org/html/2604.16686v1/figures/adapter_gains_combined.png)Figure 6:NWCAD as an adapter over existing decoders\. Each cell shows the gain from wrapping a base decoderXXwith NWCADXacross three backbones, on controlled\-slice weighted average and full\-slice average\.#### NWCADBC\{\}\_\{\\text\{BC\}\}vs\. NWCAD\.

NWCAD was developed sequentially: NWCADBC\{\}\_\{\\text\{BC\}\}uses Stage 1 \(BC gate\) to protect baseline\-correct neutral cases \(tuningτ\\tauto control conservatism\), and the full method adds Stage 2 \(CC gate \+ CAD\-style fallback\) to better utilize clearly helpful contexts\. Figure[5](https://arxiv.org/html/2604.16686#S6.F5)shows that Stage 2 improves performance on all testing cases on average by 5\.2% \(detailed results in Appendix[F](https://arxiv.org/html/2604.16686#A6)\)\. This shows that Stage 2 mainly helps when the model should use the context; the no\-fallback ablation below further suggests that most of the gain comes from choosing between no\-context and standard with\-context decoding, while the CAD\-style fallback adds a smaller extra benefit\.

#### NWCAD as an adapter over CAD/AdaCAD/CoCoA\.

NWCAD is a adapter that can be layered on top of different two stream decoders by swapping the fallback implementation\. For a base decoderXX, we instantiate NWCADXby usingXXas the fallback decoder in Stage 2\. Figure[6](https://arxiv.org/html/2604.16686#S6.F6)show that NWCAD consistently improves over each corresponding base decoder across models \(Appendix[G](https://arxiv.org/html/2604.16686#A7)reports full results\)\. On controlled slices, gains range from about 7 to over 40 accuracy points, with the largest improvements observed when wrapping CAD and CoCoA, which are more susceptible to neutral regression under distractor contexts\. Improvements are driven primarily by large boosts on Restated and Distractor subsets, while Helpful accuracy is preserved or modestly improved, indicating that NWCAD’s explicit backoff and routing primarily correct overreaction to weak or noisy context rather than suppressing genuine context utilization\.

#### No\-fallback ablation\.

Table[3](https://arxiv.org/html/2604.16686#S6.T3)reports a no\-fallback ablation on a 100\-example\-per\-slice subset\. To isolate the effect of the routing decision itself, we evaluate a simplified*No\-fallback*variant: if Stage 1 does not back off to the no\-context stream, decoding routes directly to standard with\-context logits, without using the CAD\-style fallback\. Here, NWCADBC\{\}\_\{\\text\{BC\}\}denotes the Stage 1\-only variant,*No\-fallback*denotes the no\-fallback variant, and NWCAD denotes the full two\-stage method\. The result stays very close to full NWCAD, which suggests that most of the gain comes from choosing the right branch, while the CAD\-style fallback adds a smaller extra benefit\.

Table 3:No\-fallback ablation \(in %\) on a 100\-example\-per\-slice subset \(Llama\-3\.1\-8B; accuracy; %\)\. Here, NWCADBC\{\}\_\{\\text\{BC\}\}denotes the Stage 1\-only variant, while*No\-fallback*routes directly to standard with\-context decoding whenever Stage 1 does not back off, without using the CAD\-style fallback\.
#### Routing statistics \(how often the fallback is used\)\.

The fallback decoder is intended as a safety net for a small set of uncertain steps, not the default path\. As shown in Table[4](https://arxiv.org/html/2604.16686#S6.T4), it is used for only about 1\-2% of generated tokens across tasks, which means NWCAD’s improvements do not come from frequent fallback use\. Instead, NWCAD mainly switches between the no\-context and with\-context streams on a per\-token basis, using the fallback only occasionally\. This suggests that the key to effective context\-aware decoding is decidingwhento trust the model’s prior knowledge versus the provided context, rather than constantly blending the two at every step\. On context\-driven benchmarks \(e\.g\., NQ\-SWAP\), routing toward the context stream dominates, while fallback use remains minimal\.

Table 4:Routing frequencies for NWCAD’s three\-way decision \(%\)\. \(*No\-Context*= BC/no\-context,*Context*= CC/with\-context,*Fallback*= CAD\-style fallback,*Any\-fallback*is the fraction of examples that use the fallback at least once\)\. The CAD\-style fallback is rarely invoked\.
#### Efficiency and Latency\.

Because NWCAD introduces additional control logic and, in principle, extra computation, it is important to verify that the reliability gains do not come at a inference cost\. We report Llama\-3\.1\-8B\-Instruct relative decoding latency \(sec/token\) of NWCADXversus its base decoderXXon a single RTX 5090 GPU \(FP16; microbatch=1; 30 examples per workload\) From the results presented in Table[5](https://arxiv.org/html/2604.16686#S6.T5), we observe that across ToFuEval and ExpertQA, NWCAD stays within about 2% ofXX; on short\-answer QA it can be faster because the adapter often routes to the prior/context streams and invokes contrastive mixing only rarely\.

Table 5:Relative decoding latency \(sec/token; ratio\) of NWCADXto its base decoderXX\. Values below 1 mean NWCAD is faster than the base model\. NWCAD is either similar or slightly faster\.

## 7Conclusion

We introduced No\-Worse Context\-Aware Decoding as a decode\-time mechanism for context\-conditioned generation\. NWCAD addresses neutral regression by preserving the no\-context decision while still shifting toward context under conflict\. Across all tests, NWCAD consistently improves over CAD, AdaCAD and CoCoA style decoders by reducing regressions under distractor contexts while preserving gains on genuinely helpful contexts\. More broadly, our results point to*regime selection*as a useful abstraction for future context\-aware decoding: for most steps, generation is best handled by either the no\-context stream or the context stream, with contrastive mixing needed only rarely\. This perspective motivates conditional\-compute adapters that dynamically decide when and how to rely on contextual evidence, rather than continuously mixing streams at every step\. Looking forward, key directions include extending the neutral regression prevention beyond greedy decoding and improve reliability for long form outputs\.

## Limitations

Our work is not without limitations\. First, our default thresholds are tuned on one model \(Llama\-3\.1\-8B\) and transferred to other models and benchmarks; while this works well in our experiments, these values may not be optimal for new models, domains, or prompting setups\. Second, NWCAD requires token\-level access to both no\-context and with\-context logits, so it does not directly apply to black\-box API models that do not expose logits\.

## Acknowledgments

We sincerely thank the anonymous reviewers and the members of the PortNLP group for their valuable feedback and helpful suggestions\.

## References

- AI@Meta \(2024\)AI@Meta\. 2024\.[Llama 3 model card](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)\.
- Iratni et al\. \(2025\)Malika Iratni, Mohand Boughanem, and Taoufiq Dkaki\. 2025\.[Dynamic context selection for retrieval\-augmented generation: Mitigating distractors and positional bias](https://arxiv.org/abs/2512.14313)\.*Preprint*, arXiv:2512\.14313\.
- Izacard and Grave \(2021\)Gautier Izacard and Edouard Grave\. 2021\.[Leveraging passage retrieval with generative models for open domain question answering](https://arxiv.org/abs/2007.01282)\.*Preprint*, arXiv:2007\.01282\.
- Khandelwal et al\. \(2025\)Anant Khandelwal, Manish Gupta, and Puneet Agrawal\. 2025\.[Cocoa: Confidence and context\-aware adaptive decoding for resolving knowledge conflicts in large language models](https://arxiv.org/abs/2508.17670)\.*Preprint*, arXiv:2508\.17670\.
- Kryscinski et al\. \(2020\)Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher\. 2020\.[Evaluating the factual consistency of abstractive text summarization](https://arxiv.org/abs/1910.12840)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*\.
- Lahmy and Yozevitch \(2025\)Moshe Lahmy and Roi Yozevitch\. 2025\.[Replace, don’t expand: Mitigating context dilution in multi\-hop RAG via fixed\-budget evidence assembly](https://arxiv.org/abs/2512.10787)\.*Preprint*, arXiv:2512\.10787\.
- Lee et al\. \(2019\)Kenton Lee, Ming\-Wei Chang, and Kristina Toutanova\. 2019\.[Latent retrieval for weakly supervised open domain question answering](https://doi.org/10.18653/v1/P19-1612)\.In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 6086–6096, Florence, Italy\. Association for Computational Linguistics\.
- Lewis et al\. \(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\. 2020\.[Retrieval\-augmented generation for knowledge\-intensive NLP tasks](https://arxiv.org/abs/2005.11401)\.In*Advances in Neural Information Processing Systems*\.
- Li and Ouyang \(2024\)Xiangci Li and Jessica Ouyang\. 2024\.[How does knowledge selection help retrieval augmented generation?](https://arxiv.org/abs/2410.13258)*Preprint*, arXiv:2410\.13258\.
- Lu et al\. \(2023\)Pan Lu, Yichong Zhang, Xinyu Liu, and William Yang Wang\. 2023\.Dynamic prompt learning via policy gradient for semi\-structured mathematical reasoning\.In*Proceedings of the 2023 International Conference on Learning Representations*\.
- Malaviya et al\. \(2024\)Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth\. 2024\.[Expertqa: Expert\-curated questions and attributed answers](https://doi.org/10.18653/v1/2024.naacl-long.167)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 3025–3045, Mexico City, Mexico\. Association for Computational Linguistics\.
- Mallen et al\. \(2023\)Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\. 2023\.[When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories](https://doi.org/10.18653/v1/2023.acl-long.546)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9802–9822, Toronto, Canada\. Association for Computational Linguistics\.
- Maynez et al\. \(2020\)Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald\. 2020\.[On faithfulness and factuality in abstractive summarization](https://arxiv.org/abs/2005.00661)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*\.
- Min et al\. \(2023\)Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen\-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi\. 2023\.[FActScore: Fine\-grained atomic evaluation of factual precision in long form text generation](https://aclanthology.org/2023.emnlp-main.741)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*\.
- Mistral AI \(2025\)Mistral AI\. 2025\.Introducing mistral 3\.[https://mistral\.ai/news/mistral\-3](https://mistral.ai/news/mistral-3)\.
- Nguyen et al\. \(2024\)Xuan\-Phi Nguyen, Shrey Pandit, Senthil Purushwalkam, Austin Xu, Hailin Chen, Yifei Ming, Zixuan Ke, Silvio Savarese, Caiming Xiong, and Shafiq Joty\. 2024\.[Sfr\-rag: Towards contextually faithful llms](https://arxiv.org/abs/2409.09916)\.*Preprint*, arXiv:2409\.09916\.
- Nie et al\. \(2024\)Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, and Linjun Zhang\. 2024\.[Facttest: Factuality testing in large language models with finite\-sample and distribution\-free guarantees](https://arxiv.org/abs/2411.02603)\.*Preprint*, arXiv:2411\.02603\.
- Shi et al\. \(2023\)Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen\-tau Yih\. 2023\.[Trusting your evidence: Hallucinate less with context\-aware decoding](https://arxiv.org/abs/2305.14739)\.*Preprint*, arXiv:2305\.14739\.
- Tang et al\. \(2024a\)Liyan Tang, Philippe Laban, and Greg Durrett\. 2024a\.[Minicheck: Efficient fact\-checking of llms on grounding documents](https://arxiv.org/abs/2404.10774)\.*Preprint*, arXiv:2404\.10774\.
- Tang et al\. \(2024b\)Liyan Tang, Igor Shalyminov, Amy Wong, Jon Burnsky, Jake Vincent, Yu’an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, and Kathleen McKeown\. 2024b\.[Tofueval: Evaluating hallucinations of LLMs on topic\-focused dialogue summarization](https://doi.org/10.18653/v1/2024.naacl-long.251)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 4455–4480, Mexico City, Mexico\. Association for Computational Linguistics\.
- Tao et al\. \(2025\)Yufei Tao, Adam Hiatt, Rahul Seetharaman, and Ameeta Agrawal\. 2025\.[“lost\-in\-the\-later”: Framework for quantifying contextual grounding in large language models](https://doi.org/10.1109/ICDMW69685.2025.00204)\.In*2025 IEEE International Conference on Data Mining Workshops \(ICDMW\)*, pages 1703–1712\.
- Tomani et al\. \(2024\)Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim\. 2024\.[Uncertainty\-based abstention in LLMs improves safety and reduces hallucinations](https://arxiv.org/abs/2404.10960)\.*Preprint*, arXiv:2404\.10960\.
- Wang et al\. \(2025a\)Han Wang, Archiki Prasad, Elias Stengel\-Eskin, and Mohit Bansal\. 2025a\.[Adacad: Adaptively decoding to balance conflicts between contextual and parametric knowledge](https://arxiv.org/abs/2409.07394)\.*Preprint*, arXiv:2409\.07394\.
- Wang et al\. \(2025b\)Qingni Wang, Yue Fan, and Xin Eric Wang\. 2025b\.[Safer: Risk\-constrained sample\-then\-filter in large language models](https://arxiv.org/abs/2510.10193)\.*Preprint*, arXiv:2510\.10193\.
- Yang et al\. \(2018\)Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D\. Manning\. 2018\.[Hotpotqa: A dataset for diverse, explainable multi\-hop question answering](https://doi.org/10.18653/v1/D18-1259)\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2369–2380\.
- Zhang et al\. \(2024\)Huajian Zhang, Yumo Xu, and Laura Perez\-Beltrachini\. 2024\.[Fine\-grained natural language inference based faithfulness evaluation for diverse summarisation tasks](https://doi.org/10.18653/v1/2024.eacl-long.102)\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1701–1722, St\. Julian’s, Malta\. Association for Computational Linguistics\.

## Appendix ATop\-KKJS Approximation Validation

We approximate JS divergence over the union of the top\-KKtokens from both streams \(K=50K\{=\}50in the main paper\)\. To assess whether this approximation affects routing decisions, we compare it against full\-vocab JS on Llama\-3\.1\-8B\-Instruct using 50 examples per controlled slice \(150 total; 784 generated tokens; greedy decoding with max\_new\_tokens=32,τ=0\.3\\tau\{=\}0\.3,κpri=0\.30\\kappa\_\{\\text\{pri\}\}\{=\}0\.30\)\. We also checkedK=200K\{=\}200and observed identicalNeutral​\(t\)\\text\{Neutral\}\(t\)and Stage 1 \(BC gate\) backoff decisions with even smaller JS error \(Table[A\.1](https://arxiv.org/html/2604.16686#A1.T1)\)\.

Table A\.1:Top\-KKunion JS \(K=50K\{=\}50\) vs\. full\-vocab JS on controlled\-slice decoding steps\. The approximation yields zero token\-step disagreements for neutrality/backoff decisions and closely matches full\-vocab JS \(MAE≤\\leq0\.0012\)\.
## Appendix BNWCAD Parameter Tuning

#### Tuning objective and trade\-offs\.

NWCAD introduces a small number of interpretable scalars that expose a do\-no\-harm vs\. context utilization trade\-off at decode time\. Increasing the neutrality thresholdτ\\tauexpands the region where we defer to the no\-context stream \(improving no\-worse behavior on neutral inputs\) but can over\-gate context utilization on helpful\-context inputs\. Similarly, higher confidence thresholdsκpri\\kappa\_\{\\text\{pri\}\}andκctx\\kappa\_\{\\text\{ctx\}\}make the two\-stage gate more conservative in when it trusts a stream\.

#### Parameters\.

We tune three scalars: the neutrality thresholdτ\\tau, and the top\-1 margin thresholdsκpri\\kappa\_\{\\text\{pri\}\}andκctx\\kappa\_\{\\text\{ctx\}\}used by the BC and CC gates, respectively\. Unless otherwise noted, we use the same decoding setup as the main QA experiments \(greedy, max\_new\_tokens=32\)\. We select these default thresholds on the Llama\-3\.1\-8B controlled slices \(no separate held\-out dev split\) and then reuse them for all other models and full\-slice benchmarks\. Table[B\.1](https://arxiv.org/html/2604.16686#A2.T1)summarizes this directly\. We do*not*tune thresholds separately for each model; instead, we tune them on Llama\-3\.1\-8B and reuse the same settings on Llama\-3\.1\-70B and Ministral\-3\-8B\.

Table B\.1:Reuse of the same NWCAD thresholds across model families\. The defaults are tuned once on Llama\-3\.1\-8B and then reused unchanged on Llama\-3\.1\-70B and Ministral\-3\-8B\. QA macro reports the mean across the ten full\-slice QA benchmarks in Table[D\.1](https://arxiv.org/html/2604.16686#A4.T1)\.
#### τ\\tausweep \(\-no\-harm vs\. context utilization frontier\)\.

To make the do\-no\-harm vs\. context utilization trade\-off explicit, we sweepτ\\tauon the same model and decoding setup \(Table[B\.2](https://arxiv.org/html/2604.16686#A2.T2)\)\. Restate/Distractor report accuracy on baseline\-correct neutral subsets; since the baseline is 100% by construction on BC, we also report the corresponding accuracy drop \(100−accuracy100\-\\text\{accuracy\}\) in parentheses\. Helpful reports accuracy on the helpful\-context subset; and Overall is the micro\-averaged score across these subsets\.

We useτ=0\.3\\tau\{=\}0\.3in the main results, since it achieves the best overall accuracy on this combined benchmark while maintaining strong neutral preservation; largerτ\\tauover\-gates helpful contexts\.

#### Sensitivity ofκpri\\kappa\_\{\\text\{pri\}\}\(BC gate margin threshold\)\.

κpri\\kappa\_\{\\text\{pri\}\}controls how confident the no\-context stream must be for the BC gate to apply on low\-divergence steps\. Withτ=0\.3\\tau\{=\}0\.3andκctx=0\.05\\kappa\_\{\\text\{ctx\}\}=0\.05fixed, we sweepκpri∈\{0\.00,0\.05,0\.10,0\.15,0\.20,0\.30\}\\kappa\_\{\\text\{pri\}\}\\in\\\{0\.00,0\.05,0\.10,0\.15,0\.20,0\.30\\\}and observe smooth improvements with diminishing returns asκpri\\kappa\_\{\\text\{pri\}\}increases \(Table[B\.3](https://arxiv.org/html/2604.16686#A2.T3)\)\. We useκpri=0\.30\\kappa\_\{\\text\{pri\}\}\{=\}0\.30in all experiments\.

#### Tuningκctx\\kappa\_\{\\text\{ctx\}\}\(CC gate\)\.

Withτ=0\.3\\tau\{=\}0\.3andκpri=0\.30\\kappa\_\{\\text\{pri\}\}\{=\}0\.30fixed, we sweep the context\-stream margin thresholdκctx\\kappa\_\{\\text\{ctx\}\}and report controlled\-slice accuracy on the baseline\-correct \(BC\) subsets \(Table[B\.4](https://arxiv.org/html/2604.16686#A2.T4)\)\. We chooseκctx=0\.05\\kappa\_\{\\text\{ctx\}\}\{=\}0\.05among the tested values:

Table B\.2:Sensitivity of the neutrality thresholdτ\\tauon neutral preservation and helpful utilization\.τ\\tauexposes a do\-no\-harm vs\. context utilization trade\-off\.Table B\.3:Sensitivity sweep for the BC\-gate margin thresholdκpri\\kappa\_\{\\text\{pri\}\}\(accuracy; %\)\. Performance improves smoothly and largely converges byκpri≈0\.2\\kappa\_\{\\text\{pri\}\}\{\\approx\}0\.2\-0\.3\.Table B\.4:Sensitivity sweep for the CC\-gate margin thresholdκctx\\kappa\_\{\\text\{ctx\}\}\(accuracy; %\)\. We selectκctx=0\.05\\kappa\_\{\\text\{ctx\}\}\{=\}0\.05\.

## Appendix CExact Prompting and Decoding Setup

#### QA prompting\.

For instruction\-tuned models, we use a chat\-style prompt with a*system prompt*and a*user prompt*\. The system prompt is:Answer the question concisely with just the fact or name\. Do not add explanations or extra sentences\.The user prompt ends withAnswer:to encourage short\-answer formatting\.

#### No\-context vs\. with\-context\.

Each QA example uses the same question under two conditions: a question\-only \(no\-context\) user prompt and a with\-context user prompt that prepends the retrieved/synthetic context followed by an explicit marker \(Using only the references listed above, answer the following question:\)\. This paired\-prompt setup is shared across CAD/AdaCAD/CoCoA/NWCAD: Baseline and With\-context correspond to using the no\-context or with\-context prompt, respectively, while two\-stream methods compute both distributions under the same shared generated prefix\.

#### Decoding parameters\.

Unless otherwise noted, we use greedy decoding withmax\_new\_tokens=32,max\_context\_length=4064, and a fixed seed \(seed=42\)\. Two\-stream methods compute both the with\-context and no\-context distributions at each step under the shared generated prefix\. NWCAD uses a top\-KKunion approximation for JS divergence \(K=50K\{=\}50\)\. We useτ=0\.3\\tau\{=\}0\.3,κpri=0\.30\\kappa\_\{\\text\{pri\}\}\{=\}0\.30, andκctx=0\.05\\kappa\_\{\\text\{ctx\}\}\{=\}0\.05; on the context side, if neither stream is confident, we invoke a CAD\-style fallback decoder \(CoCoA in our main experiments; the fallback can be swapped, e\.g\., AdaCAD, as an ablation\)\. For Llama\-3\.1\-70B, we additionally use quantized loading \(int8\) and microbatching for memory\.

#### Answer extraction and scoring\.

In most cases the generations are already short answers due to the answer\-only prompting\. When needed \(primarily for CoCoA\-style decoding under EOS\-suppression variants that can produce long continuations\), we apply a lightweight extraction that takes the text following the finalAnswer:marker \(if present\) or the first non\-empty line\. We report exact\-match accuracy \(SQuAD\-normalized\), which lowercases and strips punctuation, articles, and extra whitespace before matching\.

## Appendix DFull\-slice QA and Beyond\-QA: Full Tables

Tables[D\.1](https://arxiv.org/html/2604.16686#A4.T1)and[D\.2](https://arxiv.org/html/2604.16686#A4.T2)provide the numeric results underlying Figure[4](https://arxiv.org/html/2604.16686#S4.F4)\.

Table D\.1:Full\-slice QA results \(%\) on augmented NQ\-open and a diverse QA suite\. NWCAD improves robustness on the augmented slices while remaining competitive on general QA tasks; NQ\-SYNTH/NQ\-SWAP are context\-defined, so near\-zero Baseline \(no\-context\) accuracy is expected \(With\-context sanity check\); CoCoA uses EOS\-allowed decoding \(Appendix[H](https://arxiv.org/html/2604.16686#A8)\)\.Table D\.2:Beyond\-QA results on ToFuEval and ExpertQA\. NWCAD remains competitive and improves the overall score across models\.
## Appendix ELLM\-as\-a\-Judge Evaluation

To check whether strict exact\-match undercounts semantically correct short answers, we evaluate the QA outputs from all three open\-weight backbones with GPT\-4o\-mini as an LLM judge\. Table[E\.1](https://arxiv.org/html/2604.16686#A5.T1)mirrors Table[D\.1](https://arxiv.org/html/2604.16686#A4.T1): semantic evaluation raises absolute scores, but the same overall conclusion remains, with NWCAD strongest overall and especially strong on Restate\-hard and Distractor\-hard\.

Table E\.1:GPT\-4o\-mini semantic judging on the QA outputs across all three open\-weight backbones \(accuracy; %\), in the same layout as Table[D\.1](https://arxiv.org/html/2604.16686#A4.T1)\. For Llama\-3\.1\-70B we evaluate the same subset sizes used in its corresponding QA table \(500/800 examples per slice\); Llama\-3\.1\-8B and Ministral\-3\-8B use the full available QA slices\. Relative to exact match, semantic evaluation increases absolute scores while keeping the same overall picture, with NWCAD strongest overall and especially strong on distractor\-heavy / no\-harm\-oriented slices\.
## Appendix FNWCADBC\{\}\_\{\\text\{BC\}\}vs\. NWCAD\.

Table[F\.1](https://arxiv.org/html/2604.16686#A6.T1)reports the controlled\-slice comparison\. Together with the no\-fallback ablation below, it suggests that Stage 2 mainly helps when the model should use the context, while most of the gain comes from selecting the right branch\.

Table F\.1:Component ablation comparing NWCADBC\{\}\_\{\\text\{BC\}\}\(Stage 1 only\) vs\. full two\-stage NWCAD on controlled slices \(accuracy; %\)\. Stage 2 improves helpful accuracy but can trade off distractor\-hard BC accuracy\.Table[F\.2](https://arxiv.org/html/2604.16686#A6.T2)reports the corresponding full\-slice QA ablation on Llama\-3\.1\-8B using the same setup as the main QA results\. As in the controlled\-slice comparison above, NWCADBC\{\}\_\{\\text\{BC\}\}already captures most of the gain, while the full two\-stage method adds smaller additional improvements on top\.

Table F\.2:Full\-slice QA ablation comparing NWCADBC\{\}\_\{\\text\{BC\}\}\(Stage 1 only\) vs\. full NWCAD on Llama\-3\.1\-8B \(accuracy; %\)\. These values match the rebuttal table\.
## Appendix GNWCAD as an Adapter: Full Results

Tables[G\.1](https://arxiv.org/html/2604.16686#A7.T1)and[G\.2](https://arxiv.org/html/2604.16686#A7.T2)provide the numeric results underlying Figure[6](https://arxiv.org/html/2604.16686#S6.F6)\.

Table G\.1:Controlled\-slice weighted\-average accuracy \(%\) for each base decoderXXand its NWCADXwrapper \(fallback decoder =XX\)\. NWCAD improves over each corresponding base decoder across evaluated models\.Table G\.2:Full\-slice gains \(%\) from wrapping each base decoderXXwith NWCADX, computed on the unfiltered augmented NQ\-open slices \(Restate\-hard, Distractor\-hard, Helpful\)\. Gains persist on full\-slice mixtures\.
## Appendix HCoCoA EOS Suppression Ablation

The official CoCoA release suppresses the EOS token by setting\-\-dsab\-min\-len=5\(released default\), which can force long continuations up tomax\_new\_tokensand interact with answer extraction in short\-answer QA\. For short\-answer QA, we set\-\-dsab\-min\-len=0to allow EOS stopping\. Table[H\.1](https://arxiv.org/html/2604.16686#A8.T1)reports the resulting performance differences\. We report CoCoA with EOS allowed in the main paper and include the EOS\-suppressed setting here for reference\.

Table H\.1:Effect of CoCoA EOS suppression on the three augmented NQ\-open slices \(%\)\. We report CoCoA with EOS allowed in the main paper; the EOS\-suppressed setting can be sensitive to model formatting and short\-answer extraction\.
## Appendix IFull\-slice Sensitivity ofκctx\\kappa\_\{\\text\{ctx\}\}

κctx\\kappa\_\{\\text\{ctx\}\}controls Stage 2 routing on the context side: when the context stream is sufficiently confident \(top\-1 marginp1−p2≥κctxp\_\{1\}\{\-\}p\_\{2\}\\geq\\kappa\_\{\\text\{ctx\}\}\), NWCAD uses vanilla context logits; otherwise it falls back to a contrastive decoder \(CoCoA in our main setup\)\. We tuneκctx\\kappa\_\{\\text\{ctx\}\}on the controlled baseline\-correct \(BC\) subsets \(Table[B\.4](https://arxiv.org/html/2604.16686#A2.T4)\) to maximize combined BC accuracy\. Table[I\.1](https://arxiv.org/html/2604.16686#A9.T1)repeats the same sweep on full\-slice mixtures; results are stable across the tested range, indicating the choice ofκctx=0\.05\\kappa\_\{\\text\{ctx\}\}\{=\}0\.05is not brittle\.

Table I\.1:Sensitivity ofκctx\\kappa\_\{\\text\{ctx\}\}on full\-slice QA and general datasets \(accuracy; %\)\.

Similar Articles

Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning

Reddit r/ArtificialInteligence

A Harvard research paper introduces Recoding-Decoding (RD), a novel decoding scheme that injects random priming phrases and diverting tokens to tap into an LLM's long-tail knowledge, significantly boosting output diversity without fine-tuning. The method maintains high relevance while mitigating response homogenization, with stronger models showing greater diversity gains.

Targeted Neuron Modulation via Contrastive Pair Search

Hugging Face Daily Papers

Contrastive neuron attribution (CNA) identifies a sparse set of MLP neurons that distinguish harmful from benign prompts, enabling effective behavioral steering in instruction-tuned LLMs without degrading output quality. The method reduces refusal rates by over 50% on jailbreak benchmarks while preserving fluency.