Tag
This paper introduces PUMA, a plug-and-play framework that detects semantic redundancy in chain-of-thought reasoning to enable early exit, achieving 26.2% average token reduction across multiple models and benchmarks while preserving accuracy and reasoning quality.
This paper proposes Multi-Stage In-Flight Rejection (MSIFR), a training-free framework that reduces token waste in LLM-based synthetic data generation by detecting and terminating low-quality generation trajectories at intermediate checkpoints. Across five models and seven benchmarks, MSIFR reduces token consumption by 11–77% as a standalone method and up to 78.2% when combined with early-exit methods, while preserving or improving accuracy.
The paper addresses catastrophic forgetting in sequentially trained early-exiting neural networks and proposes two methods based on Elastic Weight Consolidation and Learning without Forgetting to preserve earlier exit performance while adding new ones.
Authors propose a 2D early-exit method that jointly trims layers and input sentences, yielding 1.4–2.3× extra speed-up on sentiment tasks across Llama 3.1/3.2, Gemma and Qwen models.
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.