BigCodeBench: 多様な関数呼び出しと複雑な命令を用いたコード生成のベンチマーク

要旨

大規模言語モデル（LLMs）の進展により、自動化されたソフトウェア工学は大きく強化されてきました。現在のベンチマークでは、LLMsが人間の開発者と同様にさまざまなソフトウェア工学タスクを実行できることが示されていますが、その評価の大部分は短く自己完結したアルゴリズムタスクに限定されています。挑戦的で実用的なプログラミングタスクを解決するためには、データ分析やウェブ開発などの機能を効率的に実装するために、多様な関数呼び出しをツールとして活用する能力が必要です。さらに、複数のツールを使用してタスクを解決するには、複雑な指示を正確に理解するための合成的推論が必要です。これらの特性を両立することは、LLMsにとって大きな課題となり得ます。LLMsが挑戦的で実用的なプログラミングタスクをどの程度解決できるかを評価するために、私たちはBenchというベンチマークを導入しました。このベンチマークでは、139のライブラリと7つのドメインから1,140の細粒度のプログラミングタスクに対して、LLMsが複数の関数呼び出しをツールとして呼び出す能力を試します。LLMsを厳密に評価するために、各プログラミングタスクには平均99%のブランチカバレッジを持つ5.6のテストケースが含まれています。さらに、オリジナルのdocstringを短い指示に自動的に変換する自然言語指向のバリアントであるBenchiを提案します。60のLLMsに対する広範な評価結果は、LLMsが複雑な指示に従って関数呼び出しを正確に使用する能力がまだ不十分であることを示しており、スコアは最大60%で、人間のパフォーマンスである97%を大きく下回っています。これらの結果は、この分野におけるさらなる進展の必要性を強調しています。

English

Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

BigCodeBench: 多様な関数呼び出しと複雑な命令を用いたコード生成のベンチマーク

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

要旨

Summary

Support

Support