LongDS-Bench: 長期にわたるエージェント的データ分析の失敗について

要旨

実世界データ分析は本質的に反復的であるが、既存のベンチマークの大半は孤立した短期間の対話タスクを評価するにとどまり、エージェントが長期にわたって進化する分析コンテキストを追跡する能力は未検証である。我々はLongDSを提案する。これは、エージェントが進化する分析状態を維持、更新、復元、合成しなければならない、長期・マルチターンデータ分析のためのベンチマークである。LongDSは、実世界のKaggleノートブックから構築された68のタスクで構成され、地球科学、ビジネス、教育を含む6つのドメインにわたり、計2,225ターンに及ぶ。タスクは状態進化パターン（例：反事実摂動、ロールバック、複数状態合成）に基づいて設計されており、平均依存スパンは11.3ターンである。5つの最先端モデルを評価した結果、最良モデルの平均精度は48.45%に過ぎず、初期ターンから後期ターンにかけて性能は約47ポイント低下し、長期エラーが失敗の52%～69%を占めることが判明した。さらに分析を進めると、エージェントの追加ステップは必ずしも性能向上につながらず、主要なボトルネックは対話予算の増加ではなく、正確な分析状態の維持にあることが示唆される。我々はLongDSを公開し、信頼性の高い長期エージェント型データ分析に関する研究を支援する。コードとデータはhttps://github.com/zjunlp/DataMind で公開予定である。

English

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.