@sheriyuo: This paper proposes ASAG, Attention-State Adaptive Generation, a training-free, plug-and-play stopping framework for re…
Summary
ASAG uses attention entropy to detect when reasoning is unproductive, stopping early to improve accuracy and reduce token generation. Experiments on Qwen3-8B show a 4.4% accuracy gain and over 40% fewer generated tokens.
View Cached Full Text
Cached at: 06/16/26, 05:41 PM
This paper proposes ASAG, Attention-State Adaptive Generation, a training-free, plug-and-play stopping framework for reasoning models.
Instead of relying only on output confidence, ASAG uses attention entropy to detect when further thinking is no longer useful, then stops early or redirects unproductive reasoning.
The authors report a 4.4% relative accuracy gain while cutting generated tokens by over 40% on Qwen3-8B across reasoning tasks.
Stop When Further Reasoning Won’t Help: Attention-State Adaptive Generation in Reasoning Models Paper: http://arxiv.org/abs/2606.15070
Stop When Further Reasoning Won’t Help: Attention-State Adaptive Generation in Reasoning Models
Source: https://arxiv.org/abs/2606.15070 View PDF
Abstract:By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model’s reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.
Submission history
From: Jiakai Li [view email] **[v1]**Sat, 13 Jun 2026 02:58:29 UTC (1,220 KB)
Similar Articles
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
This paper proposes ASAG, a training-free method that adaptively stops reasoning in large reasoning models based on attention distributions, reducing token usage by ~40% while improving accuracy by 3.2% on benchmarks using DeepSeek-R1-Distill and Qwen3 models.
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
AdaGATE is a training-free evidence controller for multi-hop RAG that uses entity-centric gap tracking, micro-query generation, and utility-based selection to improve robustness under noisy retrieval, achieving state-of-the-art evidence F1 with fewer input tokens.
ART: Attention Run-time Termination for Efficient Large Language Model Decoding
This paper proposes ART, a lightweight run-time mechanism that tracks accumulated attention outputs during LLM decoding and terminates unnecessary KV block accesses, achieving 20% higher generation throughput with comparable accuracy.
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
This paper introduces Agentic ASR, an interactive speech recognition framework that uses semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement. It also proposes a new sentence-level semantic error rate metric and an interactive simulation system for benchmarking.
Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.