LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Summary
LongDS is a benchmark for evaluating AI agents on long-horizon, multi-turn data analysis tasks derived from Kaggle notebooks; experiments show best models only achieve 48% accuracy with significant drop over long turns.
View Cached Full Text
Cached at: 06/01/26, 03:18 AM
Paper page - LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Source: https://huggingface.co/papers/2605.30434
Abstract
LongDS benchmark evaluates agents’ ability to maintain and update analytical states over extended data analysis sessions using real-world tasks from Kaggle notebooks.
Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents’ ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark forlong-horizon,multi-turn data analysiswhere agents must maintain, update, restore, and compose evolvinganalytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed aroundstate-evolution patterns(e.g.,counterfactual perturbation,rollback,multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, andlong-horizonerrors account for 52%--69% of failures. Further analysis shows that additionalagent stepsdo not necessarily improve performance, suggesting that the key bottleneck is maintaining a correctanalytical staterather than increasinginteraction budget. We release LongDS to support research on reliablelong-horizonagentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.30434
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30434 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30434 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30434 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@browser_use: We're #1 on Browser Arena!
browser_use announces it has reached the top position on the Browser Arena leaderboard.
Glm 5.2 looks strong but the launch is quietly mixing two different sets of numbers
GLM 5.2 appears to be a strong model update, but its launch is controversially conflating two different benchmark metric sets.
PostgresBench: A Reproducible Benchmark for Postgres Services
ClickHouse has released PostgresBench, a public and reproducible benchmark for comparing managed Postgres services, using the standard pgbench tool with TPC-B-like workloads at multiple scale factors.
@jakevin7: Sharing something interesting Maka is currently working on: letting agents automatically optimize their own system prompt, fully closed-loop, without any human intervention. Karpathy's autoresearch, AEGIS, etc. have explored similar directions—a goal-driven self-reinforcement learning system.
Maka is a local-first desktop AI workbench whose new feature allows agents to automatically optimize their own system prompts by generating variants, using Harbor container evaluation, and an acceptance policy for iterative improvement, all without human intervention.
BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling
BIM-Edit is a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) in IFC format. Results show a substantial gap, with the best model achieving only 49.5% average score across geometric, semantic, and topological metrics.