Tag
This article introduces TeamBench, a benchmark for evaluating agent coordination under enforced role separation, addressing issues where prompt-only roles may bypass intended constraints.
This paper introduces the DecodingTrust-Agent Platform (DTap), a controllable and interactive red-teaming platform for evaluating AI agent security across multiple domains. It also presents DTap-Red, an autonomous agent for discovering attack strategies, and DTap-Bench, a large-scale dataset for risk assessment.
This paper introduces SWE-WebDevBench, a comprehensive 68-metric framework for evaluating AI-powered application development platforms as virtual software agencies. The study highlights critical gaps in current platforms regarding specification understanding, backend reliability, production readiness, and security.
OpenGame is an open-source agentic framework for end-to-end web game creation, powered by the specialized GameCoder-27B model and evaluated via the new OpenGame-Bench benchmark.
OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.