RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
Summary
This paper proves that RoPE-based attention fails to distinguish token positions and identity in long contexts, explaining LLM failures within advertised context lengths. Experimental verification shows models optimized for retrieval struggle on simple list tasks.
View Cached Full Text
Cached at: 05/20/26, 06:39 PM
Paper page - RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
Source: https://huggingface.co/papers/2605.15514 LLMs often fail on inputs well within their advertised context lengths. We show that these failures are not merely engineering issues, but from intrinsic limitations of RoPE in long contexts.
Main finding: In long contexts, RoPE-based attention frequently assigns the same attention weight to a token even when it is moved to different positions. Similarly, it can assign the same attention weight to different tokens at the same position.
In this sense, RoPE attention fails to distinguish both where a token appears and what token appears there — hence the title.
We prove these results theoretically and verify them empirically. While the theoretical analysis focuses on a single attention head, we complement it with experiments on real multi-layer, multi-head LLMs. The experiments confirm failures predicted by our theory: LLMs optimized for needle-in-a-haystack-style retrieval will inevitably struggle on a very simple task that asks for the k-th item in a list.
My personal takeaway: advertised context lengths should be interpreted with care. Future long-context LMs may require rethinking how position and token order are represented. With current architectures, agentic frameworks that break long contexts into shorter ones may be a more effective way to work around the intrinsic limitations of RoPE.
Similar Articles
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
This paper provides a theoretical proof that Rotary Positional Embeddings (RoPE) in Transformer-based language models lose their locality bias and ability to distinguish token order in long contexts, with attention scores becoming no better than random. The authors show that increasing the RoPE base trades off position vs. token distinction and that multi-head, multi-layer architectures cannot compensate for this fundamental limitation.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
@samhogan: RLMs pretty much solved context btw You can shove tens of millions of tokens into a good RLM harness and it just works.…
A developer shares their experience with Recurrent Language Models (RLMs), claiming they effectively handle extremely long context windows with tens of millions of tokens, representing a significant advancement in context handling capabilities.
@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Researchers apply contrastive LRP-based attribution to analyze why LLMs fail on realistic benchmarks, finding the method gives useful signals in some cases but is not universally reliable.