Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Hugging Face Daily Papers Papers

Summary

This paper introduces Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable intermediate layer in LLMs using entropy-guided search, mitigating the alignment tax and improving reasoning performance on benchmarks like GPQA-Diamond and Omni-MATH with negligible overhead.

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.
Original Article
View Cached Full Text

Cached at: 06/23/26, 09:41 AM

Paper page - Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Source: https://huggingface.co/papers/2606.21906 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal computational overhead.

Autoregressive generationinlarge language models(LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliablenext-token predictions. We revisit this assumption by revealing a recurringGuess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduceConfident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer throughentropy-guided conservative backward search. We further provide a theoretical formulation oflayer selectionas anoptimal stopping problem, showing that under boundedprojection noiseand dominant late-stagealignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challengingreasoning benchmarks, includingGPQA-Diamond,Omni-MATH, andHLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2606\.21906

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21906 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21906 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21906 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

arXiv cs.AI

This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.