Tag
The paper addresses catastrophic forgetting in sequentially trained early-exiting neural networks and proposes two methods based on Elastic Weight Consolidation and Learning without Forgetting to preserve earlier exit performance while adding new ones.
Authors propose a 2D early-exit method that jointly trims layers and input sentences, yielding 1.4–2.3× extra speed-up on sentiment tasks across Llama 3.1/3.2, Gemma and Qwen models.
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.