Tag
This paper identifies the 'Inattentional Gap' where task-conditioned AI models suppress reporting of safety-critical signals they can otherwise detect, analogous to human inattentional blindness, challenging the assumption that benchmark performance ensures real-world safety.
This paper introduces CF-World, a counterfactual benchmark to evaluate whether text-to-image models rely on causal reasoning or mere pattern matching. Experiments show all models degrade sharply in counterfactual settings, suggesting their understanding is limited to tightly coupled visual-textual patterns rather than genuine causal reasoning.
This paper introduces regime-stratified evaluation for time series foundation models, revealing that aggregate metrics hide severe failures during traffic regime transitions, and proposes bimodal mixture augmentation to improve coverage while preserving overall accuracy.
Socratic-SWE introduces a closed-loop self-evolution framework for software engineering agents that leverages historical solving traces to generate targeted repair tasks, achieving 50.40% on SWE-bench Verified after three iterations.
FormInv proposes a measurement protocol for evaluating semantic invariance in mathematical reasoning benchmarks, revealing that model rankings reverse across paraphrase families and that standard accuracy metrics conceal large gaps in semantic consistency.
The paper introduces NEI-CAP, a diagnostic protocol to evaluate how 'Not Enough Information' examples are constructed in fact verification benchmarks, revealing that models trained on shortcut-prone NEI constructions fail to transfer to harder, semantically related insufficient evidence cases.
SkillsVote is a governance framework for long-horizon LLM agents that manages reusable skills through structured collection, recommendation, and evolution, improving performance on Terminal-Bench 2.0 and SWE-Bench Pro without model updates.
GPT-5.5 was used by Epoch to identify fatal errors in approximately one-third of the FrontierMath benchmark problems, demonstrating the model's capability to sanity-check evaluation standards.
This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench challenge, evaluating multi-agent AI systems on industrial tasks. It highlights discrepancies between public and hidden leaderboards and offers diagnostics for future agentic benchmarks.
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.
Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.