LongDS-Bench: 장기적 에이전트 기반 데이터 분석의 실패에 관하여

초록

실제 세계 데이터 분석은 본질적으로 반복적이지만, 기존 벤치마크는 대부분 고립되거나 짧은 대화형 작업만 평가하여 에이전트가 긴 시간 범위에 걸쳐 진화하는 분석 맥락을 추적하는 능력은 테스트되지 않은 상태로 남겨둔다. 우리는 에이전트가 진화하는 분석 상태를 유지, 갱신, 복원 및 구성해야 하는 장기적이고 다중 턴 데이터 분석을 위한 벤치마크인 LongDS를 소개한다. LongDS는 실제 Kaggle 노트북에서 구축된 68개의 작업으로 구성되며, 지구과학, 비즈니스, 교육 등 6개 도메인에 걸쳐 2,225개의 턴에 이른다. 작업은 상태 진화 패턴(예: 반사실적 교란, 롤백, 다중 상태 구성)을 중심으로 설계되었으며, 평균 의존성 범위는 11.3턴이다. 최첨단 모델 5개를 평가한 결과, 최고 모델의 평균 정확도는 48.45%에 불과하고, 초기 턴에서 후기 턴으로 갈수록 성능이 거의 47포인트 하락하며, 장기적 오류가 실패의 52%~69%를 차지한다는 것을 발견했다. 추가 분석에 따르면 에이전트 단계를 추가한다고 반드시 성능이 향상되는 것은 아니며, 이는 주요 병목이 상호작용 예산 증가보다는 올바른 분석 상태 유지에 있음을 시사한다. 우리는 신뢰할 수 있는 장기적 에이전트 데이터 분석 연구를 지원하기 위해 LongDS를 공개한다. 코드와 데이터는 https://github.com/zjunlp/DataMind에서 공개될 예정이다.

English

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.