LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Hugging Face Daily Papers Papers

Summary

LongDS is a benchmark for evaluating AI agents on long-horizon, multi-turn data analysis tasks derived from Kaggle notebooks; experiments show best models only achieve 48% accuracy with significant drop over long turns.

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
Original Article
View Cached Full Text

Cached at: 06/01/26, 03:18 AM

Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Source: https://huggingface.co/papers/2605.30434

Abstract

LongDS benchmark evaluates agents’ ability to maintain and update analytical states over extended data analysis sessions using real-world tasks from Kaggle notebooks.

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents’ ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark forlong-horizon,multi-turn data analysiswhere agents must maintain, update, restore, and compose evolvinganalytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed aroundstate-evolution patterns(e.g.,counterfactual perturbation,rollback,multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, andlong-horizonerrors account for 52%--69% of failures. Further analysis shows that additionalagent stepsdo not necessarily improve performance, suggesting that the key bottleneck is maintaining a correctanalytical staterather than increasinginteraction budget. We release LongDS to support research on reliablelong-horizonagentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.30434

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30434 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30434 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30434 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@jakevin7: Sharing something interesting Maka is currently working on: letting agents automatically optimize their own system prompt, fully closed-loop, without any human intervention. Karpathy's autoresearch, AEGIS, etc. have explored similar directions—a goal-driven self-reinforcement learning system.

X AI KOLs Following

Maka is a local-first desktop AI workbench whose new feature allows agents to automatically optimize their own system prompts by generating variants, using Harbor container evaluation, and an acceptance policy for iterative improvement, all without human intervention.