AgenticDataBench: A Comprehensive Benchmark for Data Agents
Summary
Introduces AgenticDataBench, a comprehensive benchmark for evaluating LLM-based data agents across diverse domains with fine-grained skill-based metrics, including real-world B2B use cases and synthetic tasks.
View Cached Full Text
Cached at: 07/03/26, 03:52 AM
Paper page - AgenticDataBench: A Comprehensive Benchmark for Data Agents
Source: https://huggingface.co/papers/2607.01647 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
A comprehensive benchmark named AgenticDataBench is introduced to evaluate data agents across diverse domains with fine-grained task annotations and skill-based coverage metrics.
Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently,large language model(LLM)-baseddata agentshave emerged as a promising solution to automatedata science workflows. However, the field lacks comprehensivebenchmarks to rigorously evaluate these agents across diverse scenarios withfine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensivebenchmarkfeaturing realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity ofdata science workflowsand the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy inreal-world tasksand generate high-quality tasks for domains lacking real data, we introduce data science skills, recurringdata-centric operational patterns, and quantifybenchmarkcoverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow usingskill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-basedtask generationapproach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-artdata agentsusing our annotatedbenchmarkand open-sourced testbed, providing detailed skill-level insights.
View arXiv pageView PDFProject pageGitHub19Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.01647 in a model README.md to link it from this page.
Datasets citing this paper1
#### shawnzzzh/AgenticDataBench Preview• Updatedabout 1 hour ago • 604 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.01647 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
This paper introduces Agent-ValueBench, a comprehensive benchmark designed to evaluate the values of autonomous agents, revealing that agent values diverge from their underlying language models.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.
EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
EnterpriseClawBench presents a benchmark for enterprise agents based on real-world workplace sessions, offering 852 reproducible tasks and comprehensive evaluation metrics beyond single performance scores.
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
Introduces EComAgentBench, a benchmark for evaluating LLM-based shopping agents on long-horizon tasks with hidden intents distributed across queries, profiles, and clarifications. The benchmark uses real Amazon products and automated scoring, revealing that even the best model achieves only 57.1% accuracy.
HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents
This paper introduces HealthAgentBench, a suite of 54 realistic healthcare tasks for evaluating frontier AI agents. It finds that even the best agent (Codex GPT-5.5) achieves only ~42% success, highlighting substantial room for improvement.