Tag
Proposes Online Agent-as-a-Judge, an evaluation framework that uses an in-world evaluator agent to actively generate situations for testing interactive social agents, improving coverage and reliability over passive methods.
The paper proposes the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environmental understanding before execution in interactive LLM agents, achieving consistent gains across benchmarks and enabling frontier models to surpass near-zero baseline performance in 22 of 25 game environments.