AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
Summary
AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.
View Cached Full Text
Cached at: 04/22/26, 10:35 AM
Paper page - AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
Source: https://huggingface.co/papers/2604.18240 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Agent-as-a-Judge benchmark evaluates automated verification capabilities across multiple domains with comprehensive task assessment.
Asreinforcement learningcontinues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers orLLM-as-a-Judgemodels, which struggle to generalize beyond narrow domains.Agent-as-a-Judgeaddresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmarkAJ-Benchto systematically evaluateAgent-as-a-Judgeacross three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents’ abilities ininformation acquisition,state verification, andprocess verification. Experiments demonstrate consistent performance gains overLLM-as-a-Judgebaselines, while also revealing substantial open challenges inagent-based verification. Our data and code are available at https://aj-bench.github.io/.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.18240
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18240 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18240 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18240 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
Proposes Online Agent-as-a-Judge, an evaluation framework that uses an in-world evaluator agent to actively generate situations for testing interactive social agents, improving coverage and reliability over passive methods.
JobBench: Aligning Agent Work With Human Will
JobBench is a benchmark built from worker surveys to evaluate AI agents on tasks that workers most want automated, covering 130 tasks across 35 professions with detailed rubrics.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
This paper introduces Agent-ValueBench, a comprehensive benchmark designed to evaluate the values of autonomous agents, revealing that agent values diverge from their underlying language models.
Benchmark Everything Everywhere All at Once
Introduces Benchmark Agent, a fully autonomous system for creating diverse benchmarks with minimal human intervention, enabling continuous model assessment across domains.
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
EnterpriseClawBench presents a benchmark for enterprise agents based on real-world workplace sessions, offering 852 reproducible tasks and comprehensive evaluation metrics beyond single performance scores.