benchmark-evaluation

Tag

Cards List
#benchmark-evaluation

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

arXiv cs.CL · 2d ago Cached

This paper identifies the 'Inattentional Gap' where task-conditioned AI models suppress reporting of safety-critical signals they can otherwise detect, analogous to human inattentional blindness, challenging the assumption that benchmark performance ensures real-world safety.

0 favorites 0 likes
#benchmark-evaluation

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Hugging Face Daily Papers · 5d ago Cached

This paper introduces CF-World, a counterfactual benchmark to evaluate whether text-to-image models rely on causal reasoning or mere pattern matching. Experiments show all models degrade sharply in counterfactual settings, suggesting their understanding is limited to tightly coupled visual-textual patterns rather than genuine causal reasoning.

0 favorites 0 likes
#benchmark-evaluation

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

arXiv cs.LG · 2026-06-18 Cached

This paper introduces regime-stratified evaluation for time series foundation models, revealing that aggregate metrics hide severe failures during traffic regime transitions, and proposes bimodal mixture augmentation to improve coverage while preserving overall accuracy.

0 favorites 0 likes
#benchmark-evaluation

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Hugging Face Daily Papers · 2026-06-05 Cached

Socratic-SWE introduces a closed-loop self-evolution framework for software engineering agents that leverages historical solving traces to generate targeted repair tasks, achieving 50.40% on SWE-bench Verified after three iterations.

0 favorites 0 likes
#benchmark-evaluation

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

arXiv cs.LG · 2026-05-29 Cached

FormInv proposes a measurement protocol for evaluating semantic invariance in mathematical reasoning benchmarks, revealing that model rankings reverse across paraphrase families and that standard accuracy metrics conceal large gaps in semantic consistency.

0 favorites 0 likes
#benchmark-evaluation

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

arXiv cs.CL · 2026-05-27 Cached

The paper introduces NEI-CAP, a diagnostic protocol to evaluate how 'Not Enough Information' examples are constructed in fact verification benchmarks, revealing that models trained on shortcut-prone NEI constructions fail to transfer to harder, semantically related insufficient evidence cases.

0 favorites 0 likes
#benchmark-evaluation

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Hugging Face Daily Papers · 2026-05-18 Cached

SkillsVote is a governance framework for long-horizon LLM agents that manages reusable skills through structured collection, recommendation, and evolution, improving performance on Terminal-Bench 2.0 and SWE-Bench Pro without model updates.

0 favorites 0 likes
#benchmark-evaluation

GPT-5.5 was used to flag fatal errors in FrontierMath problems

Reddit r/singularity · 2026-05-12

GPT-5.5 was used by Epoch to identify fatal errors in approximately one-third of the FrontierMath benchmark problems, demonstrating the model's capability to sanity-check evaluation standards.

0 favorites 0 likes
#benchmark-evaluation

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

arXiv cs.AI · 2026-05-12 Cached

This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench challenge, evaluating multi-agent AI systems on industrial tasks. It highlights discrepancies between public and hidden leaderboards and offers diagnostics for future agentic benchmarks.

0 favorites 0 likes
#benchmark-evaluation

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL · 2026-05-12 Cached

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

0 favorites 0 likes
#benchmark-evaluation

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

X AI KOLs Timeline · 2026-05-11

A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.

0 favorites 0 likes
#benchmark-evaluation

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Hugging Face Daily Papers · 2026-04-19 Cached

Academic study shows LLM agents frequently discover complete solutions in their environments but almost never use them, revealing a missing "environmental curiosity" capability critical for open-ended tasks.

0 favorites 0 likes
← Back to home

Submit Feedback