Tag
This paper proposes ART, a lightweight run-time mechanism that tracks accumulated attention outputs during LLM decoding and terminates unnecessary KV block accesses, achieving 20% higher generation throughput with comparable accuracy.
GQLA proposes a minimal modification to Multi-head Latent Attention (MLA) that exposes both an MQA-absorb path and a GQA path over the same trained weights, enabling hardware-adaptive decoding without retraining. The method compresses KV cache and supports tensor parallelism, demonstrated by converting LLaMA-3-8B from GQA to GQLA.