@lateinteraction: At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench…

X AI KOLs Following News

Summary

A commentator highlights OBLIQ-Bench (recall@k) and StudyBench (expertise) as two of the few reliable long-context benchmarks.

At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench (recall@k) and StudyBench (expertise).
Original Article

Similar Articles

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv cs.CL

This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.