DAComp:基於完整資料智慧生命週期的資料代理基準測試框架
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
December 3, 2025
作者: Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu
cs.AI
摘要
現實世界的企業數據智能工作流程包含將原始資料轉化為分析就緒表格的數據工程,以及將這些表格轉化為決策導向洞察的數據分析。我們推出DAComp基準測試,包含210項模擬這些複雜工作流程的任務。數據工程(DE)任務要求對工業級數據架構進行儲存庫層級的工程處理,包括從零開始設計並構建多階段SQL管線,以及在需求演進時對現有系統進行改造。數據分析(DA)任務則提出開放式商業問題,需要進行策略規劃、透過迭代編碼進行探索性分析、解讀中間結果,並綜合出具可行性的建議。工程類任務採用基於執行的多指標評估體系進行評分,開放式任務則由經過實驗驗證的可靠LLM評判器,依據層次化精心設計的評分標準進行評估。實驗結果顯示,即使最先進的智能代理在DAComp上也表現欠佳。數據工程任務的成功率尤其低下(不足20%),暴露出整體管線協調能力(而不僅是代碼生成)存在關鍵瓶頸。數據分析任務的平均得分也低於40%,凸顯出現有系統在開放式推理能力上的嚴重不足,並證明工程與分析是兩種截然不同的能力。通過清晰診斷這些侷限性,DAComp為推動開發真正適用於企業環境的自主數據代理,提供了嚴謹而真實的測試平台。我們的數據與程式碼公開於https://da-comp.github.io。
English
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io