DA-Code: 대규모 언어 모델을 위한 에이전트 데이터 과학 코드 생성 벤치마크

초록

DA-Code은 에이전트 기반 데이터 과학 작업에서 LLMs를 평가하기 위해 특별히 설계된 코드 생성 벤치마크를 소개합니다. 이 벤치마크에는 세 가지 핵심 요소가 있습니다. 첫째, DA-Code 내의 작업들은 본질적으로 도전적이며 전통적인 코드 생성 작업과 구분되며, 기초 및 계획에 대한 고급 코딩 기술을 요구합니다. 둘째, DA-Code의 예시들은 모두 실제 다양한 데이터를 기반으로 하며, 다양한 복잡한 데이터 전처리 및 분석 작업을 다룹니다. 셋째, 모델이 작업을 해결하기 위해서는 복잡한 데이터 과학 프로그래밍 언어를 활용하여 복잡한 데이터 처리를 수행하고 답변을 도출해야 합니다. 우리는 실제 데이터 분석 시나리오와 확장 가능한 환경과 일치하는 벤치마크를 설정했습니다. 주석 작업자들은 평가의 정확성과 견고성을 보장하기 위해 평가 스위트를 면밀히 설계했습니다. 우리는 DA-Agent 베이스라인을 개발했습니다. 실험 결과, 베이스라인이 기존의 다른 프레임워크보다 성능이 우수하지만, 현재 최고의 LLMs를 사용해도 정확도가 30.5%에 불과하여 개선할 여지가 많이 남아 있음을 보여줍니다. 우리의 벤치마크는 https://da-code-bench.github.io에서 공개되어 있습니다.

English

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

DA-Code: 대규모 언어 모델을 위한 에이전트 데이터 과학 코드 생성 벤치마크

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

초록

Support