LongDS-Bench：论长周期智能体数据分析的失败

摘要

真实世界的数据分析本质上是一个迭代过程，然而现有基准测试主要评估孤立或短交互任务，未能测试智能体在长时间跨度中追踪不断演变的分析上下文的能力。我们提出LongDS——一个面向长周期、多轮数据分析的基准测试，要求智能体必须维护、更新、恢复并组合持续演变的分析状态。LongDS包含68个基于真实Kaggle笔记本构建的任务，涵盖地球科学、商业和教育等六个领域，共计2225轮交互。任务围绕状态演化模式设计（如反事实扰动、回滚、多状态组合），平均依赖跨度为11.3轮。对五个前沿模型的评估显示，最佳模型平均准确率仅达48.45%，从早期到后期的性能下降近47个百分点，且长周期错误占失败案例的52%–69%。进一步分析表明，增加智能体的交互步骤未必能提升性能，这提示关键瓶颈在于维持正确的分析状态，而非提高交互预算。我们发布LongDS以支持可靠的长周期智能体数据分析研究。代码与数据将在https://github.com/zjunlp/DataMind发布。

English

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.