benchmark-construction

#benchmark-construction

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper proposes Embodied-BenchClaw, an autonomous multi-agent system that automatically constructs embodied spatial intelligence benchmarks from user intent through a five-stage pipeline with process quality control and an extensible Skill Library.

0 favorites 0 likes

#benchmark-construction

Benchmark Everything Everywhere All at Once

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

Introduces Benchmark Agent, a fully autonomous system for creating diverse benchmarks with minimal human intervention, enabling continuous model assessment across domains.

0 favorites 0 likes

#benchmark-construction

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

arXiv cs.AI ↗ · 2026-05-22 Cached

This paper presents QuestBench, a benchmark built by students to evaluate deep research systems across humanities and social science domains. Results show that even advanced systems like GPT-5.5 pass only 57.58% of questions, highlighting failures in trustworthiness.

0 favorites 0 likes

benchmark-construction

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Benchmark Everything Everywhere All at Once

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Submit Feedback