Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Summary
This paper systematically studies scale vectors in LLM normalization layers, showing they optimize training through a self-amplifying preconditioning effect, and proposes three lightweight improvements that enhance performance and scaling behavior with negligible overhead.
View Cached Full Text
Cached at: 05/27/26, 02:47 AM
Paper page - Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Source: https://huggingface.co/papers/2605.26895
Abstract
Scale vectors in LLMs significantly impact optimization despite minimal parameter count, with theoretical analysis and practical improvements showing enhanced training performance and scaling behavior.
Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study ofscale vectorsin LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that althoughscale vectorsconstitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, inPre-Norm architectures,scale vectorsdo not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role ofweight decayforscale vectors. By distinguishing Input-Norm andOutput-Norm layers, we theoretically show thatweight decayis beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements toscale vectors:branch-specific heterogeneity, improved placement around linear mappings, andmagnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense andmixture-of-experts modelsranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lowerterminal lossthan well-tuned baselines and exhibits more favorablescaling behavior, while adding negligible parameter and computational overhead.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.26895
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26895 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26895 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26895 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
Scaling LLMs horizontally: hidden-state coupling without weight modification [R]
Residual Coupling (RC) connects frozen language models in parallel using lightweight learned linear bridges, enabling horizontal scaling without weight modification. It reduces perplexity by up to 80.7% compared to MoE and improves accuracy on TruthfulQA by 9.1 percentage points.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.
On the Persistent Effects of Lexicality in Large Language Mod
This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.