Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Hugging Face Daily Papers 06/20/26, 12:00 AM Papers

Summary

This paper introduces Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable intermediate layer in LLMs using entropy-guided search, mitigating the alignment tax and improving reasoning performance on benchmarks like GPQA-Diamond and Omni-MATH with negligible overhead.

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

Original Article

View Cached Full Text

Cached at: 06/23/26, 09:41 AM

Paper page - Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Source: https://huggingface.co/papers/2606.21906 Authors:

Abstract

Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal computational overhead.

Autoregressive generationinlarge language models(LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliablenext-token predictions. We revisit this assumption by revealing a recurringGuess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduceConfident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer throughentropy-guided conservative backward search. We further provide a theoretical formulation oflayer selectionas anoptimal stopping problem, showing that under boundedprojection noiseand dominant late-stagealignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challengingreasoning benchmarks, includingGPQA-Diamond,Omni-MATH, andHLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.21906

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21906 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21906 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21906 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Paper page - Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

Submit Feedback

Similar Articles

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models