PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
Summary
PJ-RoPE unifies RoPE's Fourier phase, Jordan-RoPE's finite jets, and ALiBi's affine recency into a single learnable relative-position space, and studies task-driven selection of sectors.
View Cached Full Text
Cached at: 06/05/26, 08:10 AM
# PJ-RoPE: A Fourier–Jet–Affine Position Space for Relative Attention
Source: [https://arxiv.org/html/2606.05345](https://arxiv.org/html/2606.05345)
###### Abstract
We unify RoPE’s Fourier phase, Jordan\-RoPE’s finite jets, and ALiBi’s affine recency into a single learnable relative\-position space, and study which regions of this space are selected by different tasks\. PJ\-RoPE is a Fourier–Jet–Affine formulation for relative attention, with an optional Poincaré\-type reading as the affine completion of a homogeneous Fourier–jet positional representation\. Algebraically, the same primitives form a finite constant\-coefficient difference module: simple roots of the lag\-shift operator give Fourier/RoPE characters, repeated nonzero roots give Jordan/Fourier jets, and the repeated unit root gives ALiBi\-like affine recency\.
The framework separates scalar PJ\-bias kernels from exact PJ\-rotary feature transforms, introduces adaptive sector diagnostics, and uses LC/rapidity coordinates to stabilize high\-order jets\. Controlled probes verify sector containment and selection; small language runs expose an affine/recency boundary; music\-token streams provide the clearest case where LC/affine variants remain strong while carrying measurable high\-order corrections; and LC diagnostics show a scale\-stability gain coupled to phase\-resolution loss\.
## 1Introduction
Relative position representations have become a central part of long\-context attention in Transformers\[[22](https://arxiv.org/html/2606.05345#bib.bib1)\]\. RoPE encodes relative lag through rotations generated by Fourier characters\[[19](https://arxiv.org/html/2606.05345#bib.bib4)\]\. ALiBi adds a monotone affine recency term directly to attention logits\[[16](https://arxiv.org/html/2606.05345#bib.bib5)\]\. Relative attention and recurrence\-based long\-context mechanisms provide earlier evidence that lag structure matters inside attention\[[18](https://arxiv.org/html/2606.05345#bib.bib2),[4](https://arxiv.org/html/2606.05345#bib.bib3)\]\. RoPE scaling methods alter how rotary phase is allocated outside the training window\[[2](https://arxiv.org/html/2606.05345#bib.bib6),[15](https://arxiv.org/html/2606.05345#bib.bib7),[5](https://arxiv.org/html/2606.05345#bib.bib8)\]\. Jordan\-RoPE takes a different step: it replaces a semisimple rotary frequency with a finite Jordan block, producing derivative\-like finite\-jet features around the same frequency\[[24](https://arxiv.org/html/2606.05345#bib.bib26)\]\.
These methods are often described as competing drop\-in designs\. That view hides a useful common structure\. RoPE is a reduced Fourier character point\. Jordan\-RoPE is a non\-reduced thickening of that point\. High\-order Jordan\-RoPE increases the jet multiplicity\. ALiBi contributes an affine, translation\-like recency direction\. This suggests a position\-space interpretation: useful attention may require both a frequency\-jet sector and an affine sector, with the task selecting how much of each to use\.
This motivates the term Poincaré\-type\. In physics, the Poincaré group is the affine extension of the homogeneous Lorentz group obtained by adjoining translations\[[23](https://arxiv.org/html/2606.05345#bib.bib25),[7](https://arxiv.org/html/2606.05345#bib.bib24)\],
ISO\(1,n−1\)=ℝ1,n−1⋊SO\+\(1,n−1\)\.\\mathrm\{ISO\}\(1,n\-1\)=\\mathbb\{R\}^\{1,n\-1\}\\rtimes SO^\{\+\}\(1,n\-1\)\.In this paper, Poincaré\-type means an affine completion of a homogeneous Fourier–jet positional representation\. RoPE and Jordan\-RoPE live in a homogeneous phase/jet sector: they act through Fourier characters and their finite non\-semisimple thickenings\. ALiBi\-like recency supplies the corresponding affine direction, namely an additive translation\-like component in the relative\-position kernel\. This is a structural analogy, not a literal spacetime symmetry of token sequences\.
PJ\-RoPE turns this interpretation into a concrete relative\-position framework\. It defines a Fourier–Jet–Affine space of primitives and makes the sector weights learnable\. The paper studies relative attention as the object of interest\. Its scalar implementation, PJ\-bias, is an additive attention\-logit kernel containing Fourier characters, damped finite jets, affine recency terms, and LC compactified variants\. Its exact feature\-transform implementation, PJ\-rotary, applies a relative action to query and key features and is used to verify RoPE/Jordan\-RoPE closure\. Keeping these two regimes separate is essential: scalar PJ\-bias can recover scalar kernels, but it is not the same object as a rotary feature transform\.
The framework also exposes a stability problem\. High\-order jets carry powers of the relative distance\. At long context, these powers can cause transformed features, logits, and cache scales to grow dramatically\. Light\-cone PJ replaces raw distance by a compactified phaseLasinh\(d/L\)L\\operatorname\{asinh\}\(d/L\)and saturating amplituded/d2\+L2d/\\sqrt\{d^\{2\}\+L^\{2\}\}\. This controls growth, but it also compresses far\-range phase resolution\. We treat that compression as an explicit stability–resolution tradeoff\.
PJ\-RoPE is used here as a position\-space diagnostic\. The experiments ask which region of the Fourier–Jet–Affine space is selected by a task: controlled kernels recover their designed sectors, synthetic sequence tasks test whether these sectors can be used inside trainable attention, language runs concentrate on affine/recency behavior, music\-token streams show LC/affine behavior with measurable high\-order corrections, and LC diagnostics expose the stability–resolution tradeoff\.
#### Contributions\.
We make four contributions\.Position\-space formulation\.We formulate PJ\-RoPE as a Fourier–Jet–Affine relative\-position space and separate scalar PJ\-bias kernels from exact PJ\-rotary feature transforms\.Sector containment\.We identify RoPE, Jordan/high\-order finite jets, ALiBi\-like recency, and LC compactified coordinates as sectors of the same space\.Adaptive diagnostics\.We introduce sector gates, effective mass, functional energy, and leave\-one\-order\-out ablations to measure task\-level sector selection\.Evidence chain\.We evaluate sector recovery, trainable use, natural\-task allocation, and LC stability–resolution tradeoffs across controlled kernels, synthetic sequence tasks, byte\-level language, and symbolic music\-token streams\.
## 2Background
Letd=i−j≥0d=i\-j\\geq 0denote the relative lag from a query at positioniito a key at positionjj\. RoPE represents this lag by a rotation\. In complex notation, the primitive relative character is
χω\(d\)=exp\(iωd\)\.\\chi\_\{\\omega\}\(d\)=\\exp\(i\\omega d\)\.The real implementation exposes the corresponding cosine and sine components\[[19](https://arxiv.org/html/2606.05345#bib.bib4)\]\. Long\-context RoPE variants such as position interpolation, YaRN, and LongRoPE change the phase schedule used beyond the training length\[[2](https://arxiv.org/html/2606.05345#bib.bib6),[15](https://arxiv.org/html/2606.05345#bib.bib7),[5](https://arxiv.org/html/2606.05345#bib.bib8)\]\.
ALiBi uses a different primitive\. It adds a head\-specific linear recency term to the attention score\[[16](https://arxiv.org/html/2606.05345#bib.bib5)\]\. In a relative\-position basis, this is an affine direction, not a Fourier phase\. This distinction matters because a model may need local recency and oscillatory phase information for different reasons\.
Jordan\-RoPE replaces a pure rotary block by a non\-semisimple Jordan block\[[24](https://arxiv.org/html/2606.05345#bib.bib26)\]\. A first\-order Jordan correction contributes terms of the formdexp\(iωd\)d\\exp\(i\\omega d\)\. Higher\-order blocks contribute higher powers ofddmultiplied by the same Fourier character\. In differential language, these are finite jets of the Fourier character curve\. This makes Jordan\-RoPE a local thickening of RoPE in frequency space\.
## 3Related Work
#### Transformers and relative position\.
The original Transformer adds absolute sinusoidal position signals to an otherwise permutation\-invariant attention layer\[[22](https://arxiv.org/html/2606.05345#bib.bib1)\]\. Relative position representations instead inject pairwise lag structure directly into self\-attention\[[18](https://arxiv.org/html/2606.05345#bib.bib2)\], while Transformer\-XL combines relative position terms with segment\-level recurrence for longer\-context language modeling\[[4](https://arxiv.org/html/2606.05345#bib.bib3)\]\. These works motivate our relative\-attention formulation of PJ\-RoPE\.
#### Rotary and affine relative position\.
RoPE represents relative lag through a homogeneous rotary phase action, making translation of positions appear as phase differences in query/key inner products\[[19](https://arxiv.org/html/2606.05345#bib.bib4)\]\. ALiBi takes the complementary scalar route: it adds a head\-specific linear recency bias directly to attention logits\[[16](https://arxiv.org/html/2606.05345#bib.bib5)\]\. The two mechanisms are often treated as competing recipes, but for this paper they supply different sectors of a relative\-position space: a homogeneous phase sector and an affine recency sector\.
#### Long\-context phase scaling and kernelized biases\.
Position interpolation, YaRN, LongRoPE, and XPos modify how rotary or relative phase is allocated outside the training window\[[2](https://arxiv.org/html/2606.05345#bib.bib6),[15](https://arxiv.org/html/2606.05345#bib.bib7),[5](https://arxiv.org/html/2606.05345#bib.bib8),[20](https://arxiv.org/html/2606.05345#bib.bib9)\]\. Kernelized and functional relative\-position methods such as KERPLE, FIRE, and MEP generalize scalar relative biases through kernel or learned\-function families\[[3](https://arxiv.org/html/2606.05345#bib.bib10),[13](https://arxiv.org/html/2606.05345#bib.bib11),[6](https://arxiv.org/html/2606.05345#bib.bib18)\]\. Hyperbolic bias methods such as HyPE also use nonlinear distance coordinates for relative\-position bias\[[1](https://arxiv.org/html/2606.05345#bib.bib19)\]\. These methods are strong baselines for long\-context language modeling and remain important reference lines\. PJ\-RoPE asks which primitive a task selects when Fourier, finite\-jet, affine, and LC\-stabilized coordinates are available in a shared space\.
#### Relative\-position primitive families\.
Existing relative\-position mechanisms can also be grouped by the primitive function of the lag that they introduce into attention\. Table\-based and bucketed methods learn local or discretized offsets\. Bias\-centric methods add scalar recency functions to the logits, including linear, bucketed, logarithmic, or kernelized forms\. Rotary methods use Fourier phase characters and modify their frequency schedules for long\-context extrapolation\. Jordan\-type methods add polynomially modulated Fourier terms\. PJ\-RoPE follows this primitive\-family view: the Fourier sector contains order\-zero phase characters, the finite\-jet sector contains repeated\-root Fourier jets, the affine sector contains ALiBi\-like recency, and the LC branch compactifies high\-order coordinates for long\-context stability\.
#### NoPE and length\-generalization analysis\.
No\-position\-encoding studies show that causal Transformers can still acquire positional information from the causal mask and training dynamics, and broader comparisons find that the best length\-generalization behavior depends strongly on task and position mechanism\[[8](https://arxiv.org/html/2606.05345#bib.bib12),[11](https://arxiv.org/html/2606.05345#bib.bib13)\]\. This supports the sector\-selection framing: the empirical question is which region of the position space is occupied under a given training regime and domain\.
#### Algebraic and group\-theoretic positional encodings\.
Several recent works study positional encoding through algebraic or group\-action lenses\. Algebraic positional encodings interpret positions as structured operators\[[12](https://arxiv.org/html/2606.05345#bib.bib14)\]; LieRE generalizes RoPE through learned Lie rotations\[[14](https://arxiv.org/html/2606.05345#bib.bib15)\]; and GRAPE provides the closest existing group\-action unification framework for RoPE\-like multiplicative rotations and ALiBi\-like additive biases\[[25](https://arxiv.org/html/2606.05345#bib.bib16)\]\. Within that landscape, PJ\-RoPE emphasizes the non\-semisimple Fourier–jet sector, where a rotary frequency is replaced by a defective complex block, together with adaptive sector diagnostics and LC/rapidity stabilization\.
#### From Jordan\-RoPE to PJ\-RoPE\.
Jordan\-RoPE extends RoPE by replacing a semisimple rotary frequency with a finite Jordan block, yielding non\-semisimple finite\-jet corrections around a Fourier point\[[24](https://arxiv.org/html/2606.05345#bib.bib26)\]\. PJ\-RoPE keeps that sector but broadens the object of study: it adds affine recency as an explicit completion, exposes learnable gates over sectors and orders, and introduces LC/rapidity coordinates to stabilize high\-order behavior at long range\. The upgrade is therefore from a single non\-semisimple rotary sector to a learnable Fourier–Jet–Affine relative\-position space\.
#### Music sequence modeling\.
Music Transformer showed that self\-attention and relative timing are natural tools for symbolic music with long\-range repetition\[[10](https://arxiv.org/html/2606.05345#bib.bib21)\]\. MAESTRO provides aligned piano performance data with MIDI and audio\[[9](https://arxiv.org/html/2606.05345#bib.bib22)\], while MusicNet provides classical music recordings with note annotations designed for transcription research\[[21](https://arxiv.org/html/2606.05345#bib.bib23)\]\. Our use of MAESTRO and MusicNet is restricted to symbolic MIDI\-derived token streams; it is not an audio transcription benchmark\.
## 4PJ\-RoPE Position Space
PJ\-RoPE studies a finite family of relative\-position primitives indexed by lag\. At the scalar\-kernel level, these primitives are functionsK\(d\)K\(d\)that can be added to attention logits\. At the feature\-transform level, they arise from a one\-parameter relative actionG\(d\)G\(d\)applied to query and key features\. The paper uses both views, but keeps their claims separate\.
The Fourier sector contains characters such ascos\(ωd\)\\cos\(\\omega d\)andsin\(ωd\)\\sin\(\\omega d\)\. The finite\-jet sector thickens a Fourier point with terms such as
\(d/L\)rexp\(−cd/L\)cos\(ωd\),\(d/L\)^\{r\}\\exp\(\-cd/L\)\\cos\(\\omega d\),and the corresponding sine components\. The affine sector contains constants and linear recency terms such as−sd/L\-sd/L\. The LC branch is a compactified coordinate chart for high\-order Fourier–jet behavior\.
One algebraic way to view the same object is through constant\-coefficient difference modules\. Finite solutions of equations of the formP\(E\)K=0P\(E\)K=0are spanned by functions resembling
\(dr\)zd\.\\binom\{d\}\{r\}z^\{d\}\.The semisimple casez=exp\(iω\),r=0z=\\exp\(i\\omega\),r=0gives the Fourier/RoPE sector\. The non\-semisimple cases withr\>0r\>0give finite jets\. The pointz=1z=1with linear terms gives the affine/recency sector\. PJ\-RoPE packages these regions into a learnable relative\-position space with task\-selected sector weights\. Appendix[B](https://arxiv.org/html/2606.05345#A2)develops this viewpoint formally: simple roots give RoPE\-like Fourier characters, repeated nonzero roots give Jordan/Fourier jets, and the repeated unit root gives the ALiBi\-like affine direction\.
### 4\.1Poincaré\-type affine completion
The Poincaré\-type terminology refers to an affine\-completion pattern\. Let
ℋFJ=span\{\(d/L\)re−cd/Leiωd\}ω,c,r\\mathcal\{H\}\_\{\\mathrm\{FJ\}\}=\\operatorname\{span\}\\left\\\{\(d/L\)^\{r\}e^\{\-cd/L\}e^\{i\\omega d\}\\right\\\}\_\{\\omega,c,r\}denote the homogeneous Fourier–jet sector\. This sector contains RoPE atr=0r=0and Jordan\-RoPE atr\>0r\>0\. The affine recency sector is
𝒜rec=span\{1,−d/L\}\.\\mathcal\{A\}\_\{\\mathrm\{rec\}\}=\\operatorname\{span\}\\\{1,\-d/L\\\}\.PJ\-RoPE forms the finite relative\-position module
𝒫PJ=ℋFJ⊕𝒜rec⊕ℋLC,\\mathcal\{P\}\_\{\\mathrm\{PJ\}\}=\\mathcal\{H\}\_\{\\mathrm\{FJ\}\}\\oplus\\mathcal\{A\}\_\{\\mathrm\{rec\}\}\\oplus\\mathcal\{H\}\_\{\\mathrm\{LC\}\},whereℋLC\\mathcal\{H\}\_\{\\mathrm\{LC\}\}is a light\-cone compactified coordinate chart for the high\-order sector\.
This is analogous to passing from a homogeneous group to an affine group\. The homogeneous part supplies phase and finite\-jet transformations; the affine part supplies translation\-like additive recency\. In this sense, RoPE/Jordan\-RoPE correspond to the homogeneous phase representation, while PJ\-RoPE is its Poincaré\-type affine extension\. At the scalar\-kernel level this affine completion is implemented as a finite direct\-sum module with learned gates, not as a literal semidirect\-product group action\.
###### Definition 1\(PJ\-bias kernel\)\.
Fix a headhh, training scaleLL, lagd≥0d\\geq 0, frequenciesωℓh\\omega\_\{\\ell h\}, damping ratescℓh≥0c\_\{\\ell h\}\\geq 0, and maximum jet orderRR\. A scalar PJ\-bias kernel is a learned additive attention term
Kh\(d\)=gFJ,hKFJ,h\(d\)\+gaff,hKaff,h\(d\)\+gLC,hKLC,h\(d\),K\_\{h\}\(d\)=g\_\{\\mathrm\{FJ\},h\}K\_\{\\mathrm\{FJ\},h\}\(d\)\+g\_\{\\mathrm\{aff\},h\}K\_\{\\mathrm\{aff\},h\}\(d\)\+g\_\{\\mathrm\{LC\},h\}K\_\{\\mathrm\{LC\},h\}\(d\),where
KFJ,h\(d\)=∑ℓ,r\(dL\)re−cℓhd/L\[aℓrhccos\(ωℓhd\)\+aℓrhssin\(ωℓhd\)\],K\_\{\\mathrm\{FJ\},h\}\(d\)=\\sum\_\{\\ell,r\}\\left\(\\frac\{d\}\{L\}\\right\)^\{r\}e^\{\-c\_\{\\ell h\}d/L\}\\left\[a\_\{\\ell rh\}^\{c\}\\cos\(\\omega\_\{\\ell h\}d\)\+a\_\{\\ell rh\}^\{s\}\\sin\(\\omega\_\{\\ell h\}d\)\\right\],Kaff,h\(d\)=b0,h−shd/L\.K\_\{\\mathrm\{aff\},h\}\(d\)=b\_\{0,h\}\-s\_\{h\}d/L\.The sector weights are normalized gates
\(gFJ,h,gaff,h,gLC,h\)=softmax\(ηh\),\(g\_\{\\mathrm\{FJ\},h\},g\_\{\\mathrm\{aff\},h\},g\_\{\\mathrm\{LC\},h\}\)=\\operatorname\{softmax\}\(\\eta\_\{h\}\),so the three branches form a learned allocation over Fourier–jet, affine, and LC coordinates\. The LC branch is the compactified chart obtained by replacing raw phase and amplitude in the same finite family byϕL\(d\)=Lasinh\(d/L\)\\phi\_\{L\}\(d\)=L\\operatorname\{asinh\}\(d/L\)andβL\(d\)=d/d2\+L2\\beta\_\{L\}\(d\)=d/\\sqrt\{d^\{2\}\+L^\{2\}\}, for example
βL\(d\)re−cℓhϕL\(d\)/Lcos\(ωℓhϕL\(d\)\)\.\\beta\_\{L\}\(d\)^\{r\}e^\{\-c\_\{\\ell h\}\\phi\_\{L\}\(d\)/L\}\\cos\(\\omega\_\{\\ell h\}\\phi\_\{L\}\(d\)\)\.
###### Proposition 1\(Sector containment\)\.
Under parameter restrictions, the PJ position space contains the standard relative\-position primitives used in this paper\. Ther=0r=0Fourier sector is the scalar character sector associated with RoPE\. Terms withr\>0r\>0at fixedω\\omegaare finite\-jet corrections corresponding to Jordan\-RoPE and high\-order Jordan\-RoPE\. TheKaffK\_\{\\mathrm\{aff\}\}branch contains ALiBi\-like affine recency\. The LC branch contains the compactified high\-order variants used for stabilization\. At the feature\-transform level, exact PJ\-rotary recovers RoPE and Jordan\-RoPE by using the corresponding semisimple or finite\-Jordan generator\.
###### Proof sketch\.
Settinggaff=gLC=0g\_\{\\mathrm\{aff\}\}=g\_\{\\mathrm\{LC\}\}=0andr=0r=0leaves ordinary Fourier characters\. Keeping a fixed frequency and allowingr\>0r\>0gives the finite derivatives of that character curve, equivalently the polynomial coordinates generated by a finite Jordan block\. SettinggFJ=gLC=0g\_\{\\mathrm\{FJ\}\}=g\_\{\\mathrm\{LC\}\}=0gives the affine bias\. The LC sector is a coordinate substitution applied to the same finite basis, so it is a compactified subfamily; it does not assert a separate exact rotary action\. ∎
\\includestandalone
\[width=\]figure/pj\_rope\_spectral\_jet\_v14
Figure 1:PJ\-RoPE position space\. The framework organizes homogeneous Fourier–jet coordinates, affine recency, and LC\-stabilized high\-order coordinates into a shared relative\-position space with adaptive sector diagnostics\. The LC branch is a compactified coordinate chart for high\-order Fourier–jet behavior, not an exact rotary group action\.
## 5Implementation Regimes
The theory permits several implementations\. The current experiments separate four regimes because each supports a different claim\.
Table 1:Implementation regimes\. The table separates scalar additive kernels from exact feature\-transform claims\.This taxonomy separates the claims supported by each implementation\. PJ\-bias can recover scalar Fourier, jet, affine, and LC kernels, but it is not itself the RoPE feature transform\. Exact PJ\-rotary is the object used for feature\-level representation claims\. LC\-PJ bias is a stabilized scalar path: it is designed to control long\-context growth, not to provide an exact rotary group action\.
## 6Adaptive PJ
Adaptive PJ turns the position space into a measurable selection problem\. Each head receives sector gates over the Fourier/jet, affine, and light\-cone branches\. Within the FJ or LC branch, an order spectrum allocates mass across jet orders\. The diagnostics determine whether the learned position primitive moves toward the sector implied by the teacher or task\.
The three diagnostics answer complementary questions\. Effective mass asks where gated parameter magnitude is allocated\. Functional energy asks which orders contribute to the realized kernel over the evaluation window\. Leave\-one\-order\-out asks whether removing an order changes the fitted function\.
Raw parameters are not enough for this diagnostic\. Different basis functions can have different scales and supports, so the paper reports both a parameter\-side quantity,*effective mass*, and a function\-side quantity,*functional energy*\. Let
gh=softmax\(ηh\)=\(gh,FJ,gh,aff,gh,LC\)g\_\{h\}=\\operatorname\{softmax\}\(\\eta\_\{h\}\)=\(g\_\{h,\\mathrm\{FJ\}\},g\_\{h,\\mathrm\{aff\}\},g\_\{h,\\mathrm\{LC\}\}\)be the sector gates for headhh\. Within the FJ and LC sectors, letαh,rB\\alpha^\{B\}\_\{h,r\}be the conditional order spectrum for branchB∈\{FJ,LC\}B\\in\\\{\\mathrm\{FJ\},\\mathrm\{LC\}\\\}, and let
\|ζh,rB\|=\(ζh,rB,c\)2\+\(ζh,rB,s\)2\|\\zeta^\{B\}\_\{h,r\}\|=\\sqrt\{\(\\zeta^\{B,c\}\_\{h,r\}\)^\{2\}\+\(\\zeta^\{B,s\}\_\{h,r\}\)^\{2\}\}denote the signed sine/cosine amplitude magnitude at orderrr\. The branch\-specific effective mass is
Mh,rB=gh,Bαh,rB\|ζh,rB\|,B∈\{FJ,LC\}\.M^\{B\}\_\{h,r\}=g\_\{h,B\}\\alpha^\{B\}\_\{h,r\}\|\\zeta^\{B\}\_\{h,r\}\|,\\qquad B\\in\\\{\\mathrm\{FJ\},\\mathrm\{LC\}\\\}\.The order\-level effective mass used in the diagnostic plots is
M¯h,r=Mh,rFJ\+Mh,rLC∑j=0R\(Mh,jFJ\+Mh,jLC\)\+ϵ\.\\bar\{M\}\_\{h,r\}=\\frac\{M^\{\\mathrm\{FJ\}\}\_\{h,r\}\+M^\{\\mathrm\{LC\}\}\_\{h,r\}\}\{\\sum\_\{j=0\}^\{R\}\\left\(M^\{\\mathrm\{FJ\}\}\_\{h,j\}\+M^\{\\mathrm\{LC\}\}\_\{h,j\}\\right\)\+\\epsilon\}\.This is a parameter\-side answer to the question: where did the model allocate its gated high\-order amplitude? The affine branch has no jet order, so affine selection is reported throughgh,affg\_\{h,\\mathrm\{aff\}\}and the learned slope\.
Functional energy measures realized\-kernel scale over a lag window\. Write the order\-rrrealized component as
Ch,r\(d\)=gh,FJCh,rFJ\(d\)\+gh,LCCh,rLC\(d\),C\_\{h,r\}\(d\)=g\_\{h,\\mathrm\{FJ\}\}C^\{\\mathrm\{FJ\}\}\_\{h,r\}\(d\)\+g\_\{h,\\mathrm\{LC\}\}C^\{\\mathrm\{LC\}\}\_\{h,r\}\(d\),where eachCh,rBC^\{B\}\_\{h,r\}includes the corresponding basis function, order weightαh,rB\\alpha^\{B\}\_\{h,r\}, amplitudeζh,rB\\zeta^\{B\}\_\{h,r\}, damping, and LC coordinate substitution if applicable\. For a windowWWwith weightswdw\_\{d\}, define
‖f‖2,W=\(∑d∈Wwdf\(d\)2\)1/2\.\\\|f\\\|\_\{2,W\}=\\left\(\\sum\_\{d\\in W\}w\_\{d\}f\(d\)^\{2\}\\right\)^\{1/2\}\.The functional energy ratio is
Eh,r\(W\)=‖Ch,r‖2,W‖∑j=0RCh,j‖2,W\+ϵ\.E\_\{h,r\}\(W\)=\\frac\{\\\|C\_\{h,r\}\\\|\_\{2,W\}\}\{\\left\\\|\\sum\_\{j=0\}^\{R\}C\_\{h,j\}\\right\\\|\_\{2,W\}\+\\epsilon\}\.UnlikeM¯h,r\\bar\{M\}\_\{h,r\}, this quantity accounts for finite\-window scale, damping, LC compactification, phase cancellation, and extrapolation length\. It is therefore not forced to sum to one when order components are not orthogonal; it is a diagnostic of realized functional scale\.
Finally, leave\-one\-order\-out asks whether the order is needed for the fitted function:
Δh,rLOO\(W\)=MSEW\(Kh−Ch,r,y\)−MSEW\(Kh,y\)\.\\Delta^\{\\mathrm\{LOO\}\}\_\{h,r\}\(W\)=\\operatorname\{MSE\}\_\{W\}\\\!\\left\(K\_\{h\}\-C\_\{h,r\},y\\right\)\-\\operatorname\{MSE\}\_\{W\}\\\!\\left\(K\_\{h\},y\\right\)\.A useful order should have coherent evidence across these views: nontrivial effective mass, nontrivial functional energy on the evaluation window, and a positive leave\-one\-order\-out delta\. This separates parameter selection from actual function\-level contribution\.
## 7Light\-cone PJ
Raw high\-order jets contain powers of distance\. At long context, those powers can make transformed features, logits, and cache scales grow rapidly\. LC\-PJ replaces the raw coordinate with a compactified phase and a saturating amplitude:
ϕL\(d\)=Lasinh\(d/L\),βL\(d\)=dd2\+L2\.\\phi\_\{L\}\(d\)=L\\operatorname\{asinh\}\(d/L\),\\qquad\\beta\_\{L\}\(d\)=\\frac\{d\}\{\\sqrt\{d^\{2\}\+L^\{2\}\}\}\.The LC coordinate has a rapidity form\. Writing
d/L=sinhη,d/L=\\sinh\\eta,we obtain
ϕL\(d\)=Lη,βL\(d\)=tanhη\.\\phi\_\{L\}\(d\)=L\\eta,\\qquad\\beta\_\{L\}\(d\)=\\tanh\\eta\.Thus LC\-PJ replaces raw distance by a rapidity\-like phase coordinate and a velocity\-like bounded amplitude\. High\-order jet powers are applied toβL\(d\)\\beta\_\{L\}\(d\), so they remain bounded at large lag:
\|βL\(d\)\|≤1\.\|\\beta\_\{L\}\(d\)\|\\leq 1\.The price is that the rapidity coordinate grows only logarithmically in the far field, and
∂dϕL\(d\)=11\+\(d/L\)2,\\partial\_\{d\}\\phi\_\{L\}\(d\)=\\frac\{1\}\{\\sqrt\{1\+\(d/L\)^\{2\}\}\},so phase resolution is compressed at long range\. A wrong\-scale controlled contrast usingasinh\(d/L\)\\operatorname\{asinh\}\(d/L\)without the outerLLis stable in norm but collapses usable phase resolution\. The central LC claim is therefore a measurable stability–resolution tradeoff\.
From Fourier Character Points to Jordan Jet Thickenings and Affine Recency1RoPE: Fouriercharacter point2Jordan\-RoPE:finite jet thickening3ALiBi / affinerecency sectorspectral / frequency spaceRe\\mathrm\{Re\}Im\\mathrm\{Im\}eiωde^\{i\\omega d\}characterpointzωz\_\{\\omega\}χω\(d\)=eiωd\\chi\_\{\\omega\}\(d\)=e^\{i\\omega d\}order\-0 / semisimplespectral / frequency spaceRe\\mathrm\{Re\}Im\\mathrm\{Im\}eiωde^\{i\\omega d\}deiωdd\\,e^\{i\\omega d\}1st jetd2eiωdd^\{2\}e^\{i\\omega d\}2nd jet⋮\\vdotsdkeiωdd^\{k\}e^\{i\\omega d\}k\-th jetnon\-semisimple finite jetlocal nilpotent directions thickenthe character pointaffine recency space\(separate from Fourier curve\)dd110−d/L\-d/Ltranslation\-likeaffine directioncomplementary affine sector,not part of Fourier jet thickeningRoPE= simple Fourier point;Jordan\-RoPE= jet thickening;ALiBi= affine recency\.
Figure 2:Root\-level difference\-module schematic\. RoPE is a simple Fourier root, Jordan\-RoPE is a repeated nonzero root that generates finite Fourier jets, and ALiBi supplies the repeated unit\-root affine recency direction\.
## 8Summary of Questions and Observations
Table[2](https://arxiv.org/html/2606.05345#S8.T2)summarizes the experimental questions, observations, and readings used in the rest of the paper\.
Table 2:Summary of questions, observations, and readings\.Detailed reproducibility notes are placed outside the main narrative\.
## 9Experiments
The experiments describe how different tasks occupy the PJ position space\. First, fixed kernels test scalar containment\. Second, adaptive diagnostics turn controlled teachers into sector\-allocation measurements\. Third, synthetic sequence tasks test whether the sectors can be used inside trainable attention\. Fourth, byte\-level language and music\-token runs show how natural streams allocate mass under the small\-model training regimes used here\. Finally, LC diagnostics measure the stability–resolution exchange introduced by the compactified chart\.
The experiments were run in separate diagnostic suites\. We present them by their role in the argument, and a separate reproducibility bundle records the exact source files\.
#### Experimental setup\.
Unless otherwise noted, trainable sequence experiments use small causal Transformers trained at short context and evaluated by validation cross\-entropy at longer contexts\. The default trainable model has two layers, embedding dimension9696, four attention heads, and MLP ratio22, giving roughly0\.200\.20M parameters for byte\-level language runs and0\.150\.15M parameters for synthetic query classifiers\. The language boundary tests use byte\-level tokenization, train at context length10241024, and report three\-seed sampled 32768\-byte\-token stress evaluations on Tiny Shakespeare, WikiText\-2, and a Project Gutenberg War and Peace corpus\. A 32768\-token byte\-level window is about 32KB of raw text; these rows are tokenizer\-level extrapolation stress tests, not document\-scale semantic context evaluations\. The music\-token tests use symbolic MIDI\-derived byte streams, not audio labels; the 32768\-token MAESTRO and MusicNet summaries train at context length512512and report three model seeds with one long\-context evaluation batch per seed\. MAESTRO controls use a random 128\-file train\-split MIDI token stream with controller/pedal events\. MusicNet selector A and selector B are two random 64\-piece reference\-MIDI program\-token streams, sampled with seeds 29 and 37 from the MusicNet reference\-MIDI archive\. Synthetic query\-LM bridges freeze Q/K content scores in the main setting so that the positional branch must carry the teacher signal\.
### 9\.1Fixed\-kernel diagnostics
The first diagnostic isolates expressivity at the scalar\-kernel level\. We fit fixed PJ bases to target kernels representing pure phase, first/second/third jets, affine recency, mixed recency\-plus\-weak\-jet behavior, and LC targets\. A successful result includes both low loss and sector match: the selected basis should match the intended sector\.
The recovery summary shows that phase targets are recovered by RoPE\-like bases, jet targets by the corresponding finite\-jet bases, affine targets by affine bases, and LC targets by LC bases\. Table[3](https://arxiv.org/html/2606.05345#S9.T3)gives the numerical sector summary, and Figure[3](https://arxiv.org/html/2606.05345#S9.F3)gives the visual summary\. The fixed\-kernel results give the scalar containment picture: phase, finite\-jet, affine, and LC targets are recovered by their corresponding sectors\.
### 9\.2Adaptive sector recovery
The fixed\-basis experiment asks whether the space is expressive\. The adaptive experiment asks whether a learned PJ kernel can find the right region of that space\. Controlled teachers are chosen so that the expected sector is known: Fourier teachers should select order\-zero FJ mass, higher jet teachers should shift energy to higher FJ orders, affine teachers should activate the recency branch, and LC teachers should select the light\-cone branch\.
Adaptive PJ turns the position space into an observable allocation problem\. We report sector gates, effective mass, functional order energy, and leave\-one\-order\-out deltas side by side because damping, window length, LC compactification, and phase cancellation can change the realized functional contribution of a learned coefficient\. The main diagnostics are shown in Figure[3](https://arxiv.org/html/2606.05345#S9.F3)\. Adaptive PJ moves toward the teacher sector when the teacher is known\.
Table 3:Fixed\-kernel and adaptive sector recovery summary\. Gates are reported as FJ/A/LC\. The top order is the dominant functional\-energy order inside the selected high\-order branch; — indicates that the selected sector has no jet order\.


Figure 3:Fixed\-kernel and adaptive sector recovery\. Top left: fixed\-basis recovery error\. Top right: adaptive functional order energy\. Bottom: adaptive sector gates\.
### 9\.3Synthetic sequence bridge
Static kernel recovery leaves open whether a Transformer can use the same structure\. The synthetic sequence bridge therefore places PJ sectors inside a small trainable causal\-attention model\. The tasks are controlled query\-LM problems with signed jet teachers, affine attention teachers, and LC\-core teachers\. In the main bridge setting, Q/K content scores are frozen so that the positional branch must carry the relevant signal\.
The evidence shows that affine teachers are solved by the affine sector, signed first/second\-jet teachers are solved by FJ variants under multi\-length training, and LC\-core teachers activate the LC path\. These synthetic tasks isolate the trainable\-attention step: the positional branch carries the teacher signal, and the learned gates show whether the corresponding sector is used\.



Figure 4:Synthetic sequence bridge\. Top: signed jet teacher accuracy under multi\-length training\. Bottom left: learned sector gates\. Bottom right: LC\-core teacher accuracy\.
### 9\.4Natural\-language boundary test
In the byte\-level language runs, the dominant behavior is affine/recency selection\. We train small byte\-level language models at short context and evaluate them at long context, including 32768\-token stress settings\. NTK\-style RoPE scaling plus affine recency gives the lowest 32768\-token loss on Tiny Shakespeare, WikiText\-2, and War and Peace\. High\-order FJ/LC mass appears mainly as a diagnostic correction; affine recency is the leading source of loss improvement in this slice\.
### 9\.5Music\-token transfer
Music\-token streams allocate differently from byte\-level language\. On MAESTRO MIDI controls/pedal streams and two MusicNet reference\-MIDI selectors, LC\-affine variants remain competitive or best at 32768 tokens, and the learned high\-order mass is small but consistently nonzero\. This pattern is consistent with motif returns, rhythm envelopes, phrase timing, and long\-range repetition\. The exact PJ\-rotary baseline provides a feature\-transform closure control: it fits near the training scale but degrades sharply at 32768 tokens, separating representation correctness from stable long\-context usefulness\. Table[4](https://arxiv.org/html/2606.05345#S9.T4)reports the main 32768\-token contrasts\.
Table 4:Natural task summaries at 32768 tokens\. Panel \(a\) reports the language allocation; Panel \(b\) reports music\-token allocation\. Entries are mean±\\pmstandard deviation over three model seeds; lower validation cross\-entropy is better\.\(a\)Language boundary\. Models are trained at context length 1024\. The second\-best column reports the next\-best scaled\-RoPE\-plus\-affine variant\.
\(b\)Music\-token allocation\. Models are trained at context length 512\. The high\-order mass column reports the total high\-order mass of the LC\-affine row\.




Figure 5:Natural task contrast\. Language concentrates on recency/affine behavior, while MusicNet/MAESTRO MIDI\-token streams keep LC/affine variants strong with measurable high\-order corrections\. The exact PJ\-rotary panel is a feature\-transform closure control: feature\-transform correctness does not by itself guarantee stable extrapolation\.
### 9\.6GRAPE special\-case reruns
GRAPE is the closest existing group\-action framework to PJ\-RoPE\. We therefore include three exact special\-case controls: GRAPE\-M/RoPE, GRAPE\-A/ALiBi, and GRAPE\-M\+A/RoPE\+ALiBi\. These controls provide the closest standard multiplicative\-rotation and additive\-recency reference point\.
The fixed\-projection comparison in Appendix[A](https://arxiv.org/html/2606.05345#A1)isolates primitive containment\. GRAPE\-M\+A covers separate phase and affine directions, while PJ\-FJ contains phase\-modulated distance terms such as\(d/L\)cosωd\(d/L\)\\cos\\omega dand\(d/L\)2cosωd\(d/L\)^\{2\}\\cos\\omega d\. The trainable reruns then show small\-model finite\-budget behavior: GRAPE\-M\+A is a strong natural\-task control, PJ and LC\-PJ variants are competitive in some rows, and the best method depends on domain and training regime\. This separates primitive containment from trainable optimization behavior\.
### 9\.7Light\-cone stabilization
The LC experiments test the stabilizer directly\. Raw high\-order coordinates can create extreme transformed\-key norms and logit scales at 32768 tokens\. LC variants replace raw distance with compactified phase and saturating amplitude, so the relevant measurements are not only loss but also Q/K proxy, effective support, phase span, cache/logit scale, quantization error, and retrieval resolution\.
The observed pattern matches the theoretical expectation\. LC variants bound coordinate and cache scale, while the same compression reduces far\-range resolution, visible in int4 and hard\-negative retrieval probes\. The LC measurements show the expected exchange: coordinate and cache scale are bounded, while far\-range phase span and hard\-negative retrieval resolution decrease\. Table[5](https://arxiv.org/html/2606.05345#S9.T5)gives the core stability and resolution metrics\.
For clarity, the LC table uses the same stability proxy as Figure[6](https://arxiv.org/html/2606.05345#S9.F6)\. “QK proxy” is the final\-lag query/key norm proxy from the feature\-scale sweep, and “cache logit std” is the standard deviation of the cached logit proxy at the evaluation length\. LetWfarW\_\{\\mathrm\{far\}\}denote the far evaluation bucket and letψ\(d\)\\psi\(d\)be the phase coordinate used by a variant\. We define
phaseratio=maxd∈Wfarψ\(d\)−mind∈Wfarψ\(d\)maxd∈Wfard−mind∈Wfard,\\mathrm\{phase\\ ratio\}=\\frac\{\\max\_\{d\\in W\_\{\\mathrm\{far\}\}\}\\psi\(d\)\-\\min\_\{d\\in W\_\{\\mathrm\{far\}\}\}\\psi\(d\)\}\{\\max\_\{d\\in W\_\{\\mathrm\{far\}\}\}d\-\\min\_\{d\\in W\_\{\\mathrm\{far\}\}\}d\},normalized so that raw and scaled coordinates equal11\. The far span is the corresponding frequency\-weighted phase span,
maxd∈Wfarωψ\(d\)−mind∈Wfarωψ\(d\)\.\\max\_\{d\\in W\_\{\\mathrm\{far\}\}\}\\omega\\psi\(d\)\-\\min\_\{d\\in W\_\{\\mathrm\{far\}\}\}\\omega\\psi\(d\)\.
Table 5:LC stability and phase\-resolution diagnostics at 32768 tokens\. The wrong\-scale control is numerically stable but loses usable phase span\.



Figure 6:LC stability tradeoff\. LC variants bound scale and cache pressure, but the same compactification exposes a phase\-resolution cost\.#### Code and reproducibility\.
We provide the minimal implementation and experiment scripts needed to rerun the reported studies and regenerate figures and tables locally\. Generated tables, figures, run logs, large checkpoints, and raw corpora are excluded\. The reproducibility repository is available at[https://github\.com/ybzhang\-nxu/Poincare\_Rope](https://github.com/ybzhang-nxu/Poincare_Rope)\.
## 10Discussion and Scope
The experiments give a sector\-selection picture\. Controlled kernels and adaptive probes recover the designed Fourier, finite\-jet, affine, and LC regions\. Synthetic sequence tasks show that these regions can be used inside trainable attention when the positional branch carries the teacher signal\. Byte\-level language runs concentrate on affine/recency behavior under the small\-model, short\-context training regime used here\. Symbolic music\-token streams show stronger LC/affine behavior with small but measurable high\-order corrections\. LC diagnostics then explain the stabilization mechanism by measuring the accompanying loss of far\-range phase resolution\.
#### Relation to common relative\-position regimes\.
The observed allocations are consistent with the broader relative\-position literature\. In the byte\-level language runs, the strongest behavior is affine/recency selection, matching the empirical role of bias\-centric and scaled\-RoPE methods in long\-context language modeling\. The fixed\-kernel and controlled synthetic tasks occupy the Fourier–jet region, where distance\-modulated phase terms are present by construction\. The music\-token runs give an intermediate allocation: LC/affine variants remain strong, while high\-order mass is small but measurable\. The LC diagnostics isolate the corresponding implementation tradeoff: compactification controls high\-order scale and cache pressure with a reduction in far\-range phase resolution\.
Table 6:Relative\-position families and PJ\-RoPE sectors\. The table positions PJ\-RoPE as a primitive\-family formulation alongside existing relative\-position mechanisms\.Since scalar PJ\-bias kernels are lag\-indexed kernels, FFT\-style relative\-bias acceleration such as FastRPB is a natural implementation direction for scaling the bias path\[[26](https://arxiv.org/html/2606.05345#bib.bib20)\]\.
The dataset interpretation is also specific\. The MusicNet result is a reference\-MIDI token result, not an audio\-label MusicNet result\. The evidence supports a music\-token allocation pattern for structured symbolic streams, and does not address audio transcription or acoustic classification\.
#### Feature correctness versus long\-context stability\.
The exact PJ\-rotary implementation is a controlled feature\-transform contrast\. It verifies that the feature\-transform path can realize RoPE/Jordan\-RoPE\-style relative actions, but Figure[5](https://arxiv.org/html/2606.05345#S9.F5)shows a separate stability requirement at 32768 tokens: loss rises sharply when the same high\-order feature transform is evaluated far beyond the 512\-token training scale\. This contrast separates representation closure from long\-context numerical stability\. The scalar LC path serves as the stabilized chart for long\-context high\-order coordinates\.
#### Scale and generality\.
The trainable evidence uses two\-layer, 96\-dimensional small Transformers and large extrapolation ratios: language rows train at 1024 bytes and evaluate up to 32768 bytes, while music\-token rows train at 512 bytes and evaluate up to 32768 bytes\. These settings are deliberately stress tests of positional mechanisms\. They do not establish that the same allocation will hold unchanged in billion\-parameter models, longer training schedules, or subword\-tokenized language models\. Main tables now report seed standard deviations where available, but several conclusions still rely on small numbers of seeds and single long\-context evaluation batches\. Larger\-scale validation is therefore needed before treating the observed allocations as deployment rules\.
## 11Conclusion
PJ\-RoPE reframes relative positional representation as a learnable Fourier–Jet–Affine space: an affine completion of a homogeneous Fourier–jet phase representation\. RoPE, Jordan\-RoPE, high\-order jets, and ALiBi\-like recency are not isolated mechanisms but identifiable sectors of that space\. Adaptive PJ turns this view into an empirical diagnostic by exposing which sector a task selects\. Light\-cone PJ stabilizes the high\-order regime in rapidity/light\-cone coordinates and makes the cost of that stability measurable\.
The resulting paper is a controlled position\-space argument\. Different tasks select different regions of the relative\-position space: synthetic teachers recover their intended sectors, small language models prefer recency/affine structure, and music\-token streams reveal LC/affine allocation with measurable high\-order corrections\. Stable long\-context use of the space therefore depends on a task\-dependent allocation between affine recency, Fourier–jet structure, and LC compactification\.
## Appendix ADetailed Comparison with GRAPE
### A\.1Scope of the comparison
GRAPE provides the closest existing group\-action unification framework for RoPE\-like multiplicative rotations and ALiBi\-like additive biases\[[25](https://arxiv.org/html/2606.05345#bib.bib16)\]\. The comparison locates the overlap and the difference: GRAPE emphasizes exact group\-action laws and cacheable relative actions, while PJ\-RoPE emphasizes the non\-semisimple Fourier–jet sector, adaptive sector diagnostics, and LC/rapidity stabilization\.
The baselines in this appendix are exact special cases of the GRAPE framework, not implementations of full learned GRAPE\-M, GRAPE\-A, or GRAPE\-AP\. We compare GRAPE\-M special case / RoPE, GRAPE\-A special case / ALiBi, and GRAPE\-M\+A special case / RoPE\+ALiBi as controlled primitive bases\. The comparison is about primitive function spaces, sector selection, and high\-order stability\.
### A\.2What GRAPE covers
GRAPE formulates positional encoding as group actions with exact relative laws\. Multiplicative GRAPE represents positions through norm\-preserving rotationsG\(n\)=exp\(nωL\)∈SO\(d\)G\(n\)=\\exp\(n\\omega L\)\\in SO\(d\), with rank\-two skew generators\. RoPE is recovered when the rotation planes are canonical coordinate pairs with a chosen frequency spectrum\. Additive GRAPE realizes additive logit biases through low\-rank unipotent lifts in a larger linear group, recovering ALiBi\-like additive slopes as exact special cases\. These properties are important: exact relative composition, norm\-preserving multiplicative actions, additive unipotent lifts, and streaming cacheability are central strengths of GRAPE\.
PJ\-RoPE is complementary on a different axis\. It asks what changes when the rotary Fourier character itself is made non\-semisimple, producing finite jetsdreiωdd^\{r\}e^\{i\\omega d\}, and how those high\-order coordinates can be diagnosed and stabilized at long context\.
### A\.3Algebraic distinction: where the nilpotent lives
In the additive GRAPE special case, the nilpotent lives in a unipotent lift that generates an additive bias\. A schematic form is
Gadd\(n\)=I\+nωA,A2=0\.G\_\{\\mathrm\{add\}\}\(n\)=I\+n\\omega A,\\qquad A^\{2\}=0\.This is the correct algebraic home for ALiBi\-like additive slopes\.
In the Fourier–jet sector of PJ\-RoPE, the nilpotent instead lives inside the complex rotary eigenvalue block:
J=\(−γ\+iω\)I\+ηN,Nm=0\.J=\(\-\\gamma\+i\\omega\)I\+\\eta N,\\qquad N^\{m\}=0\.Exponentiating gives
edJ=e\(−γ\+iω\)d∑r=0m−1\(ηd\)rr\!Nr\.e^\{dJ\}=e^\{\(\-\\gamma\+i\\omega\)d\}\\sum\_\{r=0\}^\{m\-1\}\\frac\{\(\\eta d\)^\{r\}\}\{r\!\}N^\{r\}\.Thus the primitive modes include
eiωd,deiωd,d2eiωd,…\.e^\{i\\omega d\},\\quad de^\{i\\omega d\},\\quad d^\{2\}e^\{i\\omega d\},\\ldots\.The key difference is that GRAPE\-M\+A special cases combine phase and distance as separate primitives, whereas PJ\-FJ makes distance\-modulated phase a primitive\.
###### Lemma A\.1\(Direct\-sum phase/affine features do not contain Fourier jets\)\.
Forω≢0,π\\omega\\not\\equiv 0,\\pi, the functiondeiωdde^\{i\\omega d\}is not contained in the finite span
span\{1,d,eiωd,e−iωd\}\\operatorname\{span\}\\\{1,d,e^\{i\\omega d\},e^\{\-i\\omega d\}\\\}over infinitely many integer lags\.
###### Proof sketch\.
This follows from the linear independence of exponential\-polynomial sequences with distinct characteristic roots\. The termdeiωdde^\{i\\omega d\}corresponds to a repeated root ateiωe^\{i\\omega\}, while the direct\-sum phase/affine basis only contains simple roots ate±iωe^\{\\pm i\\omega\}and the affine root at11\. This is the special case of the repeated\-root description in Appendix[B](https://arxiv.org/html/2606.05345#A2)\. ∎
### A\.4Conceptual comparison table
Table 7:Conceptual comparison between GRAPE and PJ\-RoPE\. GRAPE emphasizes exact group\-action unification and cacheable relative laws; PJ\-RoPE emphasizes non\-semisimple Fourier jets, adaptive sector diagnostics, and LC stabilization\.Table 8:Exactness and primitive\-mode sanity table\. GRAPE special\-case rows cover exact group\-action laws; PJ rows identify the Fourier\-jet and LC axes used in this paper\.
### A\.5Controlled primitive\-basis evidence
The fixed projection experiment makes the algebraic distinction measurable\. We fit each targety\(d\)y\(d\)by least squares over a fixed basisB\(d\)cB\(d\)c\. All rows use the same frequency grid\. PJ\-FJ variants add jet orders at that frequency; the experiment is a primitive\-containment probe\.
The important targets arecosωd\\cos\\omega d,−d/L\-d/L,\(d/L\)cosωd\(d/L\)\\cos\\omega d, and\(d/L\)2cosωd\(d/L\)^\{2\}\\cos\\omega d\. The direct\-sum GRAPE\-M\+A special case covers phase and affine targets\. Phase\-modulated distance appears as an explicit primitive in PJ\-FJ order one and two\.
Table 9:GRAPE special\-case and PJ primitive\-basis projections\. All rows use the same frequency grid; PJ\-FJ variants add jet orders at that frequency\. The table reports primitive containment through fixed projection\. Values areR2R^\{2\}at the longest evaluation length; em dashes mark failed extrapolations withR2<−10R^\{2\}<\-10\.
### A\.6Cross\-task stress reruns with GRAPE special\-case controls
Table[10](https://arxiv.org/html/2606.05345#A1.T10)reports GRAPE special\-case reruns under the same small trainable setting\. Natural\-task rows use the 8192\-token maximum length from this appendix sweep; the main natural\-task table uses the 32768\-token budget\. Table[9](https://arxiv.org/html/2606.05345#A1.T9)tests primitive containment by projection; these rows test finite\-budget optimization and extrapolation\. In the signed first\-jet row, GRAPE\-M/RoPE outperforms PJ variants at 384 tokens, although it does not contain the Fourier\-jet primitive\. This indicates that a phase\-only surrogate can correlate with the thresholded teacher labels over the sampled window, while the learned jet branch is less stable under this extrapolation setting\. Thus Table[9](https://arxiv.org/html/2606.05345#A1.T9)and Table[10](https://arxiv.org/html/2606.05345#A1.T10)answer different questions: primitive containment versus finite\-budget trainability\.
Table 10:Cross\-task stress reruns with GRAPE special\-case controls\. Rows use restricted GRAPE\-M/A exact special\-case controls\. Synthetic rows report accuracy over three seeds; language and music rows report validation cross\-entropy over two seeds at the maximum length available in this appendix sweep, 8192 tokens for natural\-task rows\. The table reports finite\-budget trainability and extrapolation; Table[9](https://arxiv.org/html/2606.05345#A1.T9)reports primitive containment\. The PJ reference column is PJ\-FJ for synthetic rows and LC\-PJ\+A for natural rows\.
### A\.7Summary of the comparison
Overall, GRAPE and PJ\-RoPE overlap on the unification axis but emphasize different coordinates\. GRAPE emphasizes exact group\-action laws, norm\-preserving multiplicative rotations, additive unipotent lifts, and cacheable relative actions\. PJ\-RoPE emphasizes non\-semisimple Fourier jets, adaptive sector diagnostics, and LC stabilization of high\-order behavior\.
## Appendix BConstant\-coefficient Difference Modules and Fourier Jets
This appendix makes explicit the difference\-module viewpoint used in the main text\. It is not a new experimental claim\. Its purpose is to put RoPE, Jordan\-RoPE, ALiBi, and the Fourier–jet part of PJ\-RoPE into the standard language of shift operators and repeated characteristic roots\.
### B\.1Shift operator and characteristic roots
LetEEdenote the discrete shift operator on functions of the integer lag:
\(Ef\)\(d\)=f\(d\+1\)\.\(Ef\)\(d\)=f\(d\+1\)\.Iff\(d\)=zdf\(d\)=z^\{d\}, then
Thuszdz^\{d\}is a characteristic mode of the shift operator with rootzz\. RoPE corresponds to simple unit\-circle rootsz=eiωz=e^\{i\\omega\}\. A damped Fourier mode,
e−cd/Leiωd=zd,z=e−c/L\+iω,e^\{\-cd/L\}e^\{i\\omega d\}=z^\{d\},\\qquad z=e^\{\-c/L\+i\\omega\},corresponds to a simple interior root with\|z\|<1\|z\|<1\.
### B\.2Repeated roots and finite jets
The natural polynomial basis for repeated roots of the shift operator is the binomial basis
ϕz,r\(d\)=\(dr\)zd,r≥0\.\\phi\_\{z,r\}\(d\)=\\binom\{d\}\{r\}z^\{d\},\\qquad r\\geq 0\.We setϕz,−1=0\\phi\_\{z,\-1\}=0\.
###### Proposition 2\(Repeated roots generate finite jets\)\.
LetEEbe the shift operator and define
ϕz,r\(d\)=\(dr\)zd\.\\phi\_\{z,r\}\(d\)=\\binom\{d\}\{r\}z^\{d\}\.Then
\(E−z\)ϕz,r=zϕz,r−1,\(E\-z\)\\phi\_\{z,r\}=z\\phi\_\{z,r\-1\},and therefore
ϕz,r∈ker\(E−z\)r\+1\.\\phi\_\{z,r\}\\in\\ker\(E\-z\)^\{r\+1\}\.Thus a rootzzof multiplicitymmgenerates the modes
zd,\(d1\)zd,…,\(dm−1\)zd\.z^\{d\},\\binom\{d\}\{1\}z^\{d\},\\ldots,\\binom\{d\}\{m\-1\}z^\{d\}\.
###### Proof\.
Using\(d\+1r\)−\(dr\)=\(dr−1\)\\binom\{d\+1\}\{r\}\-\\binom\{d\}\{r\}=\\binom\{d\}\{r\-1\}, we have
\(E−z\)ϕz,r\(d\)=\(d\+1r\)zd\+1−z\(dr\)zd=z\(dr−1\)zd=zϕz,r−1\(d\)\.\(E\-z\)\\phi\_\{z,r\}\(d\)=\\binom\{d\+1\}\{r\}z^\{d\+1\}\-z\\binom\{d\}\{r\}z^\{d\}=z\\binom\{d\}\{r\-1\}z^\{d\}=z\\phi\_\{z,r\-1\}\(d\)\.Iterating the identity gives\(E−z\)r\+1ϕz,r=0\(E\-z\)^\{r\+1\}\\phi\_\{z,r\}=0\. ∎
The main text often writes jet factors as\(d/L\)rzd\(d/L\)^\{r\}z^\{d\}\. This is only a different normalization\. For any fixed maximum order, the monomial basis\{dr\}\\\{d^\{r\}\\\}and the binomial basis\{\(dr\)\}\\\{\\binom\{d\}\{r\}\\\}are related by an invertible triangular change of basis, so they span the same finite polynomial\-jet space\.
### B\.3RoPE, Jordan\-RoPE, and ALiBi as roots
The root interpretation gives a compact dictionary for the positional primitives in this paper\.
Table 11:Difference\-module interpretation of common relative\-position primitives\.In this notation, RoPE satisfies\(E−eiω\)K=0\(E\-e^\{i\\omega\}\)K=0\. First\-order Jordan\-RoPE satisfies\(E−eiω\)2K=0\(E\-e^\{i\\omega\}\)^\{2\}K=0\. ALiBi’s affine direction is the unit\-root jet, since\(E−1\)2d=0\(E\-1\)^\{2\}d=0\. A first\-order complex PJ kernel can be viewed as a solution of
\(E−z\)2\(E−1\)2K=0,z=eiω\.\(E\-z\)^\{2\}\(E\-1\)^\{2\}K=0,\\qquad z=e^\{i\\omega\}\.For real\-valued kernels, the conjugate root must also be included:
\(E−z\)2\(E−z¯\)2\(E−1\)2K=0\.\(E\-z\)^\{2\}\(E\-\\overline\{z\}\)^\{2\}\(E\-1\)^\{2\}K=0\.The corresponding real basis contains
cosωd,sinωd,dcosωd,dsinωd,1,d\.\\cos\\omega d,\\quad\\sin\\omega d,\\quad d\\cos\\omega d,\\quad d\\sin\\omega d,\\quad 1,\\quad d\.
### B\.4General PJ position space as a difference module
Let
P\(t\)=∏a=1M\(t−za\)ma\.P\(t\)=\\prod\_\{a=1\}^\{M\}\(t\-z\_\{a\}\)^\{m\_\{a\}\}\.The finite\-dimensional solution space of the constant\-coefficient difference equation
is the exponential\-polynomial space
K\(d\)=∑a=1M∑r=0ma−1ca,r\(dr\)zad\.K\(d\)=\\sum\_\{a=1\}^\{M\}\\sum\_\{r=0\}^\{m\_\{a\}\-1\}c\_\{a,r\}\\binom\{d\}\{r\}z\_\{a\}^\{d\}\.Thus the Fourier–jet part of PJ\-RoPE can be read as a finite exponential\-polynomial solution space generated by simple and repeated roots of the shift operator\. The affine recency branch is the repeated root atz=1z=1\.
### B\.5Equivalence with finite\-dimensional representations
The same structure appears as matrix coefficients of finite\-dimensional representations\. Suppose
K\(d\)=u⊤TdvK\(d\)=u^\{\\top\}T^\{d\}vand
T≃⨁aza\(I\+Na\),Nama=0\.T\\simeq\\bigoplus\_\{a\}z\_\{a\}\(I\+N\_\{a\}\),\\qquad N\_\{a\}^\{m\_\{a\}\}=0\.Then
Td≃⨁azad\(I\+Na\)d=⨁azad∑r=0ma−1\(dr\)Nar\.T^\{d\}\\simeq\\bigoplus\_\{a\}z\_\{a\}^\{d\}\(I\+N\_\{a\}\)^\{d\}=\\bigoplus\_\{a\}z\_\{a\}^\{d\}\\sum\_\{r=0\}^\{m\_\{a\}\-1\}\\binom\{d\}\{r\}N\_\{a\}^\{r\}\.Every matrix coefficient is therefore a finite linear combination of\(dr\)zad\\binom\{d\}\{r\}z\_\{a\}^\{d\}\.
Table 12:Three equivalent languages for the same finite\-jet structure\.
### B\.6Stability from root location
Writez=ρeiωz=\\rho e^\{i\\omega\}\. The root location explains the basic stability regimes:
\|z\|<1:damped exponential decay,\|z\|=1,simple root:bounded oscillation,\|z\|=1,repeated root:polynomial growth,\|z\|\>1:exponential growth\.\\begin\{array\}\[\]\{ll\}\|z\|<1:&\\text\{damped exponential decay\},\\\\ \|z\|=1,\\text\{ simple root\}:&\\text\{bounded oscillation\},\\\\ \|z\|=1,\\text\{ repeated root\}:&\\text\{polynomial growth\},\\\\ \|z\|\>1:&\\text\{exponential growth\}\.\\end\{array\}Repeated roots on the unit circle are exactly where polynomial growth enters\. This explains why high\-order Fourier jets are expressive but require stabilization at long context\.
LC\-PJ is not itself an exact constant\-coefficient difference\-module solution, because
ϕL\(d\)=Lasinh\(d/L\)\\phi\_\{L\}\(d\)=L\\operatorname\{asinh\}\(d/L\)is a nonlinear coordinate substitution\. It should instead be understood as a compactified deformation of the repeated\-root jet coordinates, designed to preserve local jet behavior while bounding far\-field growth\.
### B\.7Fourier\-distribution interpretation
There is an equivalent Fourier\-distribution way to say the same thing\. The ordinary Fourier charactereiωde^\{i\\omega d\}corresponds to a point massδω\\delta\_\{\\omega\}in frequency\. Since
deiωd=1i∂ωeiωd,de^\{i\\omega d\}=\\frac\{1\}\{i\}\\partial\_\{\\omega\}e^\{i\\omega d\},the first Fourier jet corresponds to a derivative\-of\-delta direction at the same frequency\. ALiBi’s linear term is the analogous derivative direction at zero frequency\. This is the distributional version of the repeated\-root statement: Jordan\-RoPE replaces a spectral point mass by its finite jet, and ALiBi is the zero\-frequency affine jet\.
## References
- \[1\]G\. Angelotti\(2023\)HyPE: attention with hyperbolic biases for relative positional encoding\.Note:arXiv preprint arXiv:2310\.19676External Links:2310\.19676,[Document](https://dx.doi.org/10.48550/arXiv.2310.19676),[Link](https://arxiv.org/abs/2310.19676)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.7.4.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[2\]S\. Chen, S\. Wong, L\. Chen, and Y\. Tian\(2023\)Extending context window of large language models via positional interpolation\.Note:arXiv preprint arXiv:2306\.15595External Links:2306\.15595,[Link](https://arxiv.org/abs/2306.15595)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.6.3.2.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p1.4),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[3\]T\. Chi, T\. Fan, P\. J\. Ramadge, and A\. I\. Rudnicky\(2022\)KERPLE: kernelized relative positional embedding for length extrapolation\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 8386–8399\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/37a413841a614b5414b333585e7613b8-Abstract-Conference.html)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.5.2.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[4\]Z\. Dai, Z\. Yang, Y\. Yang, J\. Carbonell, Q\. Le, and R\. Salakhutdinov\(2019\-07\)Transformer\-XL: attentive language models beyond a fixed\-length context\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 2978–2988\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1285),[Link](https://aclanthology.org/P19-1285/)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.4.1.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px1.p1.1)\.
- \[5\]Y\. Ding, L\. L\. Zhang, C\. Zhang, Y\. Xu, N\. Shang, J\. Xu, F\. Yang, and M\. Yang\(2024\)LongRoPE: extending LLM context window beyond 2 million tokens\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 11091–11104\.External Links:[Link](https://proceedings.mlr.press/v235/ding24i.html)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.6.3.2.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p1.4),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[6\]W\. Gao\(2024\)MEP: multiple kernel learning enhancing relative positional encoding length extrapolation\.Note:arXiv preprint arXiv:2403\.17698External Links:2403\.17698,[Document](https://dx.doi.org/10.48550/arXiv.2403.17698),[Link](https://arxiv.org/abs/2403.17698)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.5.2.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[7\]B\. C\. Hall\(2015\)Lie groups, lie algebras, and representations: an elementary introduction\.2 edition,Graduate Texts in Mathematics, Vol\.222,Springer,Cham\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-13467-3),[Link](https://link.springer.com/book/10.1007/978-3-319-13467-3)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p3.1)\.
- \[8\]A\. Haviv, O\. Ram, O\. Press, P\. Izsak, and O\. Levy\(2022\-12\)Transformer language models without positional encodings still learn positional information\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Abu Dhabi, United Arab Emirates,pp\. 1382–1390\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.99),[Link](https://aclanthology.org/2022.findings-emnlp.99/)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px5.p1.1)\.
- \[9\]C\. Hawthorne, A\. Stasyuk, A\. Roberts, I\. Simon, C\. A\. Huang, S\. Dieleman, E\. Elsen, J\. Engel, and D\. Eck\(2019\)Enabling factorized piano music modeling and generation with the MAESTRO dataset\.InInternational Conference on Learning Representations,Note:arXiv:1810\.12247External Links:[Link](https://openreview.net/forum?id=r1lYRjC9F7)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px8.p1.1)\.
- \[10\]C\. A\. Huang, A\. Vaswani, J\. Uszkoreit, N\. Shazeer, I\. Simon, C\. Hawthorne, A\. M\. Dai, M\. D\. Hoffman, M\. Dinculescu, and D\. Eck\(2019\)Music transformer: generating music with long\-term structure\.InInternational Conference on Learning Representations,Note:arXiv:1809\.04281External Links:[Link](https://openreview.net/forum?id=rJe4ShAcF7)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px8.p1.1)\.
- \[11\]A\. Kazemnejad, I\. Padhi, K\. N\. Ramamurthy, P\. Das, and S\. Reddy\(2023\)The impact of positional encoding on length generalization in transformers\.InAdvances in Neural Information Processing Systems,Vol\.36\.Note:arXiv:2305\.19466External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/4e85362c02172c0c6567ce593122d31c-Abstract-Conference.html)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px5.p1.1)\.
- \[12\]K\. Kogkalidis, J\. Bernardy, and V\. Garg\(2024\)Algebraic positional encodings\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Document](https://dx.doi.org/10.52202/079017-1099),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/3d8f2fdc04fa66c9239f2eb14379546d-Abstract-Conference.html)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px6.p1.1)\.
- \[13\]S\. Li, C\. You, G\. Guruganesh, J\. Ainslie, S\. Ontanon, M\. Zaheer, S\. Sanghai, Y\. Yang, S\. Kumar, and S\. Bhojanapalli\(2024\)Functional interpolation for relative positions improves long context transformers\.InInternational Conference on Learning Representations,Note:arXiv:2310\.04418External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/hash/2f55a8b7b1c2c6312eb86557bb9a2bd5-Abstract-Conference.html)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.5.2.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[14\]S\. Ostmeier, B\. Axelrod, M\. Varma, M\. E\. Moseley, A\. Chaudhari, and C\. Langlotz\(2025\)LieRE: lie rotational positional encodings\.InProceedings of the 42nd International Conference on Machine Learning,Note:ICML 2025; arXiv:2406\.10322External Links:2406\.10322,[Link](https://arxiv.org/abs/2406.10322)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px6.p1.1)\.
- \[15\]B\. Peng, J\. Quesnelle, H\. Fan, and E\. Shippole\(2024\)YaRN: efficient context window extension of large language models\.InInternational Conference on Learning Representations,Note:arXiv:2309\.00071External Links:[Link](https://openreview.net/forum?id=wHBfxhZu1u)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.6.3.2.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p1.4),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[16\]O\. Press, N\. A\. Smith, and M\. Lewis\(2022\)Train short, test long: attention with linear biases enables input length extrapolation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=R8sQPpGCv0)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.1.1.3.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p2.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px2.p1.1)\.
- \[17\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.External Links:[Link](https://jmlr.org/papers/v21/20-074.html)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.1.1.3.1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.4.1.2.1.1)\.
- \[18\]P\. Shaw, J\. Uszkoreit, and A\. Vaswani\(2018\-06\)Self\-attention with relative position representations\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),New Orleans, Louisiana,pp\. 464–468\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-2074),[Link](https://aclanthology.org/N18-2074/)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.4.1.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px1.p1.1)\.
- \[19\]J\. Su, Y\. Lu, S\. Pan, A\. Murtadha, B\. Wen, and Y\. Liu\(2021\)RoFormer: enhanced transformer with rotary position embedding\.Note:arXiv preprint arXiv:2104\.09864External Links:2104\.09864,[Link](https://arxiv.org/abs/2104.09864)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.6.3.2.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p1.4),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px2.p1.1)\.
- \[20\]Y\. Sun, L\. Dong, B\. Patra, S\. Ma, S\. Huang, A\. Benhaim, V\. Chaudhary, X\. Song, and F\. Wei\(2023\-07\)A length\-extrapolatable transformer\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 14590–14604\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.816),[Link](https://aclanthology.org/2023.acl-long.816/)Cited by:[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.6.3.2.1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px3.p1.1)\.
- \[21\]J\. Thickstun, Z\. Harchaoui, and S\. Kakade\(2017\)Learning features of music from scratch\.InInternational Conference on Learning Representations,Note:arXiv:1611\.09827External Links:[Link](https://openreview.net/forum?id=rkFBJv9gg)Cited by:[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px8.p1.1)\.
- \[22\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30,pp\. 5998–6008\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px1.p1.1)\.
- \[23\]E\. P\. Wigner\(1939\)On unitary representations of the inhomogeneous lorentz group\.Annals of Mathematics40\(1\),pp\. 149–204\.External Links:[Document](https://dx.doi.org/10.2307/1968551),[Link](https://www.jstor.org/stable/1968551)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p3.1)\.
- \[24\]Y\. Zhang\(2026\)Jordan\-RoPE: non\-semisimple relative positional encoding via complex jordan blocks\.Note:arXiv preprint arXiv:2605\.04217External Links:2605\.04217,[Document](https://dx.doi.org/10.48550/arXiv.2605.04217),[Link](https://arxiv.org/abs/2605.04217)Cited by:[§1](https://arxiv.org/html/2606.05345#S1.p1.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.2.3.1.1),[§2](https://arxiv.org/html/2606.05345#S2.p3.2),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px7.p1.1)\.
- \[25\]Y\. Zhang, Z\. Chen, Y\. Liu, Z\. Qin, H\. Yuan, K\. Xu, Y\. Yuan, Q\. Gu, and A\. C\. Yao\(2026\)Group representational position encoding\.InInternational Conference on Learning Representations,Note:ICLR 2026; arXiv:2512\.07805External Links:2512\.07805,[Document](https://dx.doi.org/10.48550/arXiv.2512.07805),[Link](https://arxiv.org/abs/2512.07805)Cited by:[§A\.1](https://arxiv.org/html/2606.05345#A1.SS1.p1.1),[§3](https://arxiv.org/html/2606.05345#S3.SS0.SSS0.Px6.p1.1)\.
- \[26\]M\. Zubkov and D\. Gavrilov\(2022\)FastRPB: a scalable relative positional encoding for long sequence tasks\.Note:arXiv preprint arXiv:2202\.11364External Links:2202\.11364,[Document](https://dx.doi.org/10.48550/arXiv.2202.11364),[Link](https://arxiv.org/abs/2202.11364)Cited by:[§10](https://arxiv.org/html/2606.05345#S10.SS0.SSS0.Px1.p2.1),[Table 6](https://arxiv.org/html/2606.05345#S10.T6.2.8.5.2.1.1)\.Similar Articles
RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways
This paper proposes RoVE, a parameter-free modification to Rotary Position Embeddings that makes value pathways position-sensitive by rotating values simultaneously with keys, transforming RoPE attention into attentive convolution. Experiments on GPT-2 models show consistent gains in few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval.
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
This paper proves that RoPE-based attention fails to distinguish token positions and identity in long contexts, explaining LLM failures within advertised context lengths. Experimental verification shows models optimized for retrieval struggle on simple list tasks.
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
This paper provides a theoretical proof that Rotary Positional Embeddings (RoPE) in Transformer-based language models lose their locality bias and ability to distinguish token order in long contexts, with attention scores becoming no better than random. The authors show that increasing the RoPE base trades off position vs. token distinction and that multi-head, multi-layer architectures cannot compensate for this fundamental limitation.
FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder
FRAPPE is a novel autoencoding framework that uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. At high compression ratios, FRAPPE-Image achieves higher perceptual quality than AVIF with 47x faster encoding, making real-time 1080p 30fps CPU-only encoding possible.
CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning
Proposes CF-JEPA, a mask-free self-supervised learning framework for time-series that uses multi-horizon forward prediction from random crops and exploits asymmetry between online and target encoders for improved performance on classification, forecasting, and anomaly detection.