High Dimensional, Dynamic Rotary Positional Embedding [P]

Reddit r/MachineLearning Papers

Summary

Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.

At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding? I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos. A GPT-2-like model trained on TinyStories with hyperparameters copied from https://huggingface.co/roneneldan/TinyStories-33M (n_blocks=4, d_model=d_k=d_v=768) The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture. Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position. HDD-RoPE moves past this intuition and instead says that position within a sequence is multidimensional. Therefore, the chunks can be broken into any size, such as 4 as used in the TinyStories example. Four-dimensional chunks correspond to 4 choose 2 = 6 axes of rotation (6-dimensional position.) Essentially, we're saying that a token doesn't just lie at a position within the sequence, but a position within any construct the model can learn, such as a paragraph or sentence. To facilitate this, I also make the amount of rotation along each axis data-dependent, such that it can learn how to advance the positions based on information stored in the current layer's activations. If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap.
Original Article

Similar Articles

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

arXiv cs.LG

This paper proposes RoVE, a parameter-free modification to Rotary Position Embeddings that makes value pathways position-sensitive by rotating values simultaneously with keys, transforming RoPE attention into attentive convolution. Experiments on GPT-2 models show consistent gains in few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval.

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

arXiv cs.CL

This paper provides a theoretical proof that Rotary Positional Embeddings (RoPE) in Transformer-based language models lose their locality bias and ability to distinguish token order in long contexts, with attention scores becoming no better than random. The authors show that increasing the RoPE base trades off position vs. token distinction and that multi-head, multi-layer architectures cannot compensate for this fundamental limitation.