llm-decoding

#llm-decoding

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper proposes ART, a lightweight run-time mechanism that tracks accumulated attention outputs during LLM decoding and terminates unnecessary KV block accesses, achieving 20% higher generation throughput with comparable accuracy.

0 favorites 0 likes

#llm-decoding

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

arXiv cs.LG ↗ · 2026-05-18 Cached

GQLA proposes a minimal modification to Multi-head Latent Attention (MLA) that exposes both an MQA-absorb path and a GQA path over the same trained weights, enabling hardware-adaptive decoding without retraining. The method compresses KV cache and supports tensor parallelism, demonstrated by converting LLaMA-3-8B from GQA to GQLA.

0 favorites 0 likes

llm-decoding

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Submit Feedback