Tag
GPT-5.5 was used by Epoch to identify fatal errors in approximately one-third of the FrontierMath benchmark problems, demonstrating the model's capability to sanity-check evaluation standards.
This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench challenge, evaluating multi-agent AI systems on industrial tasks. It highlights discrepancies between public and hidden leaderboards and offers diagnostics for future agentic benchmarks.
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.
Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.