AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

Summary

AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/22/26, 10:35 AM

Paper page - AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Source: https://huggingface.co/papers/2604.18240 Authors:

Abstract

Agent-as-a-Judge benchmark evaluates automated verification capabilities across multiple domains with comprehensive task assessment.

Asreinforcement learningcontinues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers orLLM-as-a-Judgemodels, which struggle to generalize beyond narrow domains.Agent-as-a-Judgeaddresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmarkAJ-Benchto systematically evaluateAgent-as-a-Judgeacross three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents’ abilities ininformation acquisition,state verification, andprocess verification. Experiments demonstrate consistent performance gains overLLM-as-a-Judgebaselines, while also revealing substantial open challenges inagent-based verification. Our data and code are available at https://aj-bench.github.io/.

View arXiv page View PDF Project page GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2604\.18240

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18240 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18240 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18240 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Paper page - AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

ProgramBench (5 minute read)

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

Submit Feedback

Similar Articles

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.