ChatPaper.aiChatPaper

DA-Code:用于大型语言模型的代理数据科学代码生成基准测试

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

October 9, 2024
作者: Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu
cs.AI

摘要

我们介绍了DA-Code,这是一个专门设计用于评估基于代理的数据科学任务中LLMs的代码生成基准。该基准包括三个核心要素:首先,DA-Code中的任务本质上具有挑战性,使其与传统代码生成任务有所区别,并要求具备基础和规划方面的高级编码技能。其次,DA-Code中的示例都基于真实和多样化的数据,涵盖了广泛的复杂数据处理和分析任务。第三,为了解决这些任务,模型必须利用复杂的数据科学编程语言,执行复杂的数据处理并得出答案。我们在一个可控制和可执行的环境中建立了这个基准,与真实世界的数据分析场景相吻合,并且可扩展。标注者们精心设计了评估套件,以确保评估的准确性和稳健性。我们开发了DA-Agent基准。实验表明,尽管基准优于其他现有框架,但使用当前最佳的LLMs仅能达到30.5%的准确率,仍有很大的改进空间。我们在https://da-code-bench.github.io发布了我们的基准。
English
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

Summary

AI-Generated Summary

PDF53November 16, 2024