length-extrapolation

#length-extrapolation

Why Do Accumulated Transformations Extrapolate?

arXiv cs.LG ↗ · 3d ago Cached

This paper investigates why accumulated token-dependent orthogonal transformations, such as those used in PaTH Attention and a simplified variant with SO(2) rotations, enable length extrapolation in transformers. It proves that such transformations become incoherent after a finite number of steps, suppressing attention to distant tokens, and shows both theoretically and experimentally that this mechanism improves extrapolation but eventually degrades at extreme context lengths.

0 favorites 0 likes

length-extrapolation

Why Do Accumulated Transformations Extrapolate?

Submit Feedback