Tag
This paper analyzes neural activation patterns across six LLM architectures on cognitive tasks, revealing differences in attention entropy and sparsity between encoder and decoder models.
Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.