DA-Code: 大規模言語モデルのためのエージェントデータサイエンスコード生成ベンチマーク

要旨

DA-Codeは、エージェントベースのデータサイエンスタスクでLLM（Large Language Models）の評価を行うために特別に設計されたコード生成ベンチマークを紹介します。このベンチマークには3つの主要な要素があります。まず、DA-Code内のタスクは本質的に挑戦的であり、従来のコード生成タスクとは異なり、グラウンディングとプランニングにおける高度なコーディングスキルを要求します。第二に、DA-Codeの例はすべて実際の多様なデータに基づいており、幅広い複雑なデータ整形や分析タスクをカバーしています。第三に、モデルがタスクを解決するためには、複雑なデータサイエンスプログラミング言語を利用して、入念なデータ処理を行い、回答を導出する必要があります。我々は、実行可能な環境で設定されたベンチマークを構築し、現実世界のデータ分析シナリオに合わせてスケーラブルにしました。アノテーターは、評価の正確性と堅牢性を確保するために評価スイートを入念に設計しています。我々はDA-Agentのベースラインを開発しました。実験の結果、ベースラインは他の既存のフレームワークよりも優れたパフォーマンスを示しますが、現在の最高のLLMを使用しても正解率はわずか30.5％にとどまり、改善の余地が十分にあります。弊社のベンチマークはhttps://da-code-bench.github.ioで公開されています。

English

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

DA-Code: 大規模言語モデルのためのエージェントデータサイエンスコード生成ベンチマーク

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

要旨

Support