agent-as-judge

Tag

Cards List
#agent-as-judge

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Hugging Face Daily Papers · 2026-04-20 Cached

AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.

0 favorites 0 likes
← Back to home

Submit Feedback