Tag
PartRep proposes a selective prompt repetition method for decoder-only LLMs that appends only the most informative tokens (selected via NLL) instead of the full prompt, reducing KV cache and prefill FLOPs while retaining most of the accuracy gains across multiple benchmarks.
This paper studies the problem of recovering input token sequences from last-layer hidden states of decoder-only language models using continuous embedding-space optimization, revealing that high-frequency function words are the main failure points while content words recover almost perfectly, achieving up to 97.5% exact-match rate.
The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.