benchmark-evaluation

#benchmark-evaluation

GPT-5.5 was used to flag fatal errors in FrontierMath problems

Reddit r/singularity ↗ · yesterday

GPT-5.5 was used by Epoch to identify fatal errors in approximately one-third of the FrontierMath benchmark problems, demonstrating the model's capability to sanity-check evaluation standards.

0 favorites 0 likes

#benchmark-evaluation

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

arXiv cs.AI ↗ · yesterday Cached

This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench challenge, evaluating multi-agent AI systems on industrial tasks. It highlights discrepancies between public and hidden leaderboards and offers diagnostics for future agentic benchmarks.

0 favorites 0 likes

#benchmark-evaluation

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL ↗ · yesterday Cached

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

0 favorites 0 likes

#benchmark-evaluation

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

X AI KOLs Timeline ↗ · 2d ago

A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.

0 favorites 0 likes

#benchmark-evaluation

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.

0 favorites 0 likes

benchmark-evaluation

GPT-5.5 was used to flag fatal errors in FrontierMath problems

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Submit Feedback