High Dimensional, Dynamic Rotary Positional Embedding [P]

Reddit r/MachineLearning 06/24/26, 06:16 PM Papers

positional-embedding rotary-positional-embedding transformer attention nlp tiny-stories high-dimensional

Summary

Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.

At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding? I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos. A GPT-2-like model trained on TinyStories with hyperparameters copied from https://huggingface.co/roneneldan/TinyStories-33M (n_blocks=4, d_model=d_k=d_v=768) The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture. Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position. HDD-RoPE moves past this intuition and instead says that position within a sequence is multidimensional. Therefore, the chunks can be broken into any size, such as 4 as used in the TinyStories example. Four-dimensional chunks correspond to 4 choose 2 = 6 axes of rotation (6-dimensional position.) Essentially, we're saying that a token doesn't just lie at a position within the sequence, but a position within any construct the model can learn, such as a paragraph or sentence. To facilitate this, I also make the amount of rotation along each axis data-dependent, such that it can learn how to advance the positions based on information stored in the current layer's activations. If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap.

Original Article

High Dimensional, Dynamic Rotary Positional Embedding [P]

Similar Articles

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

@ZhihuFrontier: DeepSeek-V4 RoPE Design In-Depth Analysis Key technical insights curated from Zhihu contributor kaiyuan Core Pain Point…

Submit Feedback

Similar Articles

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

@ZhihuFrontier: DeepSeek-V4 RoPE Design In-Depth Analysis Key technical insights curated from Zhihu contributor kaiyuan Core Pain Point…