AgenticDataBench: A Comprehensive Benchmark for Data Agents

Hugging Face Daily Papers Papers

Summary

Introduces AgenticDataBench, a comprehensive benchmark for evaluating LLM-based data agents across diverse domains with fine-grained skill-based metrics, including real-world B2B use cases and synthetic tasks.

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.
Original Article
View Cached Full Text

Cached at: 07/03/26, 03:52 AM

Paper page - AgenticDataBench: A Comprehensive Benchmark for Data Agents

Source: https://huggingface.co/papers/2607.01647 Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

A comprehensive benchmark named AgenticDataBench is introduced to evaluate data agents across diverse domains with fine-grained task annotations and skill-based coverage metrics.

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently,large language model(LLM)-baseddata agentshave emerged as a promising solution to automatedata science workflows. However, the field lacks comprehensivebenchmarks to rigorously evaluate these agents across diverse scenarios withfine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensivebenchmarkfeaturing realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity ofdata science workflowsand the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy inreal-world tasksand generate high-quality tasks for domains lacking real data, we introduce data science skills, recurringdata-centric operational patterns, and quantifybenchmarkcoverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow usingskill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-basedtask generationapproach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-artdata agentsusing our annotatedbenchmarkand open-sourced testbed, providing detailed skill-level insights.

View arXiv pageView PDFProject pageGitHub19Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.01647 in a model README.md to link it from this page.

Datasets citing this paper1

#### shawnzzzh/AgenticDataBench Preview• Updatedabout 1 hour ago • 604 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.01647 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles