reasoning-benchmark

#reasoning-benchmark

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv cs.CL ↗ · 2026-05-25 Cached

This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.

0 favorites 0 likes

reasoning-benchmark

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Submit Feedback