group-query-latent-attention

Tag

Cards List
#group-query-latent-attention

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

arXiv cs.LG · 2026-05-18 Cached

GQLA proposes a minimal modification to Multi-head Latent Attention (MLA) that exposes both an MQA-absorb path and a GQA path over the same trained weights, enabling hardware-adaptive decoding without retraining. The method compresses KV cache and supports tensor parallelism, demonstrated by converting LLaMA-3-8B from GQA to GQLA.

0 favorites 0 likes
← Back to home

Submit Feedback