Tag
Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.
This paper provides a theoretical proof that Rotary Positional Embeddings (RoPE) in Transformer-based language models lose their locality bias and ability to distinguish token order in long contexts, with attention scores becoming no better than random. The authors show that increasing the RoPE base trades off position vs. token distinction and that multi-head, multi-layer architectures cannot compensate for this fundamental limitation.