agent-as-judge

#agent-as-judge

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.

0 favorites 0 likes

agent-as-judge

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Submit Feedback