grouped-query-attention

#grouped-query-attention

Architecture, Not Scale: Circuit Localization in Large Language Models

arXiv cs.CL ↗ · 2d ago Cached

This paper challenges the assumption that mechanistic interpretability becomes harder as models scale, showing that architecture (specifically Grouped Query Attention vs. Multi-Head Attention) matters more than parameter count for circuit localization and stability.

0 favorites 0 likes

grouped-query-attention

Architecture, Not Scale: Circuit Localization in Large Language Models

Submit Feedback