LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

long-horizon multi-turn data-analysis benchmark agent state-evolution kaggle

Summary

LongDS is a benchmark for evaluating AI agents on long-horizon, multi-turn data analysis tasks derived from Kaggle notebooks; experiments show best models only achieve 48% accuracy with significant drop over long turns.

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

Original Article

View Cached Full Text

Cached at: 06/01/26, 03:18 AM

Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Source: https://huggingface.co/papers/2605.30434

Abstract

LongDS benchmark evaluates agents’ ability to maintain and update analytical states over extended data analysis sessions using real-world tasks from Kaggle notebooks.

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents’ ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark forlong-horizon,multi-turn data analysiswhere agents must maintain, update, restore, and compose evolvinganalytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed aroundstate-evolution patterns(e.g.,counterfactual perturbation,rollback,multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, andlong-horizonerrors account for 52%--69% of failures. Further analysis shows that additionalagent stepsdo not necessarily improve performance, suggesting that the key bottleneck is maintaining a correctanalytical staterather than increasinginteraction budget. We release LongDS to support research on reliablelong-horizonagentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.30434

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30434 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30434 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30434 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@browser_use: We're #1 on Browser Arena!

Glm 5.2 looks strong but the launch is quietly mixing two different sets of numbers

PostgresBench: A Reproducible Benchmark for Postgres Services

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

Submit Feedback

Similar Articles

@browser_use: We're #1 on Browser Arena!

Glm 5.2 looks strong but the launch is quietly mixing two different sets of numbers

PostgresBench: A Reproducible Benchmark for Postgres Services

@jakevin7: Sharing something interesting Maka is currently working on: letting agents automatically optimize their own system prompt, fully closed-loop, without any human intervention. Karpathy's autoresearch, AEGIS, etc. have explored similar directions—a goal-driven self-reinforcement learning system.

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling