LongDS-Bench: 長時域自主資料分析之失敗探討

摘要

真實世界資料分析本質上具有迭代性，然而現有基準主要評估孤立或短互動任務，未能測試代理在長期過程中追蹤演進分析脈絡的能力。為此，我們提出LongDS——一個專為長期多輪資料分析設計的基準，要求代理必須維護、更新、還原並組合不斷演進的分析狀態。LongDS包含68個以真實Kaggle筆記本為基礎的任務，橫跨地球科學、商業與教育等六大領域，共計2,225輪互動。任務圍繞狀態演化模式（如反事實擾動、回滾、多狀態組合）設計，平均依賴跨度為11.3輪。在評估五個最先進模型後，我們發現最佳模型平均準確率僅達48.45%，從早期到後期輪次效能下降近47個百分點，且長期錯誤占失敗原因的52%至69%。進一步分析顯示，增加代理步驟未必能提升效能，關鍵瓶頸在於維持正確的分析狀態，而非提高互動次數。我們釋出LongDS以支持可靠長期代理式資料分析的研究。程式碼與資料將於https://github.com/zjunlp/DataMind 公開。

English

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.