BigCodeBench：使用多樣功能呼叫和複雜指令進行代碼生成基準測試

摘要

近年來，大型語言模型（LLMs）在程式設計方面的最新進展極大地推動了自動化軟體工程。雖然目前的基準測試顯示LLMs能夠執行各種軟體工程任務，如同人類開發者一樣，但它們的評估大多僅限於短小且獨立的演算任務。要解決具有挑戰性和實際性的程式設計任務，需要具備利用各種函數呼叫作為工具，以有效實現資料分析和網頁開發等功能的能力。此外，使用多個工具解決一個任務需要進行組合推理，準確理解複雜的指令。實現這兩個特點對LLMs來說可能是一大挑戰。為了評估LLMs解決具有挑戰性和實際性的程式設計任務的能力，我們引入了Bench，一個基準測試，挑戰LLMs從139個庫和7個領域中選擇多個函數呼叫作為工具，解決1,140個細粒度程式設計任務。為了嚴謹評估LLMs，每個程式設計任務包含5.6個測試案例，平均分支覆蓋率達99%。此外，我們提出了Bench的自然語言導向變體Benchi，將原始文件字串自動轉換為僅包含基本信息的簡短指令。我們對60個LLMs進行了廣泛評估，結果顯示LLMs尚未能夠準確遵循複雜指令使用函數呼叫，得分最高僅為60%，遠低於人類的97%。這些結果強調了在這一領域進一步進展的必要性。

English

Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

BigCodeBench：使用多樣功能呼叫和複雜指令進行代碼生成基準測試

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

摘要

Support