Tag
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.