Tag
The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.