AgentDS技术报告：领域特定数据科学中人类与AI协作的未来基准测试

摘要

数据科学在将复杂数据转化为跨领域可执行洞察方面发挥着关键作用。大型语言模型与人工智能代理的最新进展显著推动了数据科学工作流的自动化进程。然而，人工智能代理在领域特定数据科学任务上究竟能在多大程度比肩人类专家，以及人类专业能力在哪些方面仍具优势，目前尚不明确。我们推出AgentDS评估框架暨竞赛平台，旨在系统评估人工智能代理及人机协作在领域特定数据科学任务中的表现。该平台涵盖商业、食品生产、医疗保健、保险、制造和零售银行六大行业的17项挑战任务。通过举办有29支队伍、80名参赛者参与的公开竞赛，我们实现了人机协作方案与纯人工智能基准线的系统性对比。研究结果表明，当前人工智能代理在领域特定推理方面仍存在局限——纯AI基准线表现仅接近或低于参赛者中位数水平，而最优解决方案均诞生于人机协作模式。这些发现对"AI可实现完全自动化"的论调提出挑战，既彰显了人类专业能力在数据科学中的持久价值，也为下一代人工智能发展指明了方向。访问AgentDS官网https://agentds.org/ 及开源数据集https://huggingface.co/datasets/lainmn/AgentDS 获取更多信息。

English

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

AgentDS技术报告：领域特定数据科学中人类与AI协作的未来基准测试

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

摘要

Support