ChatPaper.aiChatPaper

DA-Code:大型語言模型的代理數據科學程式碼生成基準。

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

October 9, 2024
作者: Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu
cs.AI

摘要

我們介紹了 DA-Code,這是一個專門設計用於評估基於代理的資料科學任務上的LLM的代碼生成基準。這個基準包含三個核心元素:首先,DA-Code中的任務本質上具有挑戰性,使其與傳統的代碼生成任務有所區別,需要具有先進編碼技能的基礎和規劃能力。其次,DA-Code中的示例都基於真實和多樣化的數據,涵蓋了廣泛的複雜數據整理和分析任務。第三,為了解決這些任務,模型必須利用複雜的資料科學編程語言,執行複雜的數據處理並得出答案。我們在一個可控且可執行的環境中設置了這個基準,這個環境與現實世界的數據分析場景相一致並且可擴展。標註者精心設計了評估套件,以確保評估的準確性和韌性。我們開發了DA-Agent基準。實驗表明,儘管基準優於其他現有框架,但使用當前最佳的LLM僅達到30.5%的準確率,仍有很大的改進空間。我們在https://da-code-bench.github.io 發布了我們的基準。
English
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

Summary

AI-Generated Summary

PDF53November 16, 2024