evaluation-benchmark

Tag

Cards List
#evaluation-benchmark

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

arXiv cs.AI · 2d ago Cached

This article introduces TeamBench, a benchmark for evaluating agent coordination under enforced role separation, addressing issues where prompt-only roles may bypass intended constraints.

0 favorites 0 likes
#evaluation-benchmark

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Hugging Face Daily Papers · 2026-05-06 Cached

This paper introduces the DecodingTrust-Agent Platform (DTap), a controllable and interactive red-teaming platform for evaluating AI agent security across multiple domains. It also presents DTap-Red, an autonomous agent for discovering attack strategies, and DTap-Bench, a large-scale dataset for risk assessment.

0 favorites 0 likes
#evaluation-benchmark

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Hugging Face Daily Papers · 2026-05-06 Cached

This paper introduces SWE-WebDevBench, a comprehensive 68-metric framework for evaluating AI-powered application development platforms as virtual software agencies. The study highlights critical gaps in current platforms regarding specification understanding, backend reliability, production readiness, and security.

0 favorites 0 likes
#evaluation-benchmark

OpenGame: Open Agentic Coding for Games

Papers with Code Trending · 2026-04-20 Cached

OpenGame is an open-source agentic framework for end-to-end web game creation, powered by the specialized GameCoder-27B model and evaluated via the new OpenGame-Bench benchmark.

0 favorites 0 likes
#evaluation-benchmark

Measuring the performance of our models on real-world tasks

OpenAI Blog · 2025-09-25 Cached

OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.

0 favorites 0 likes
← Back to home

Submit Feedback