Tag
This paper introduces the Curse of Depth in LLMs, where deep layers become ineffective due to Pre-Layer Normalization causing output variance explosion. The authors propose LayerNorm Scaling to mitigate this, showing consistent improvements in pre-training and fine-tuning across model sizes up to 7B.