Tag
The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.