BigCodeBench：使用多样功能调用和复杂指令对代码生成进行基准测试

摘要

近年来，大型语言模型（LLMs）在编程方面取得的进展极大地推动了自动化软件工程。尽管当前的基准测试显示LLMs能够像人类开发人员一样执行各种软件工程任务，但它们的评估大多局限于简短且自包含的算法任务。解决具有挑战性和实际意义的编程任务需要利用各种函数调用作为工具，以高效地实现诸如数据分析和Web开发等功能。此外，使用多个工具解决一个任务需要通过准确理解复杂指令进行组合推理。满足这两个特征对LLMs来说可能是一个巨大挑战。为了评估LLMs在解决具有挑战性和实际意义的编程任务方面的表现，我们引入了Bench，一个基准测试，挑战LLMs从139个库和7个领域中调用多个函数调用作为工具来完成1,140个细粒度编程任务。为了严格评估LLMs，每个编程任务包含5.6个测试用例，平均分支覆盖率达到99%。此外，我们提出了Bench的自然语言导向变体Benchi，它会自动将原始文档字符串转换为仅包含基本信息的简短指令。我们对60个LLMs进行了广泛评估，结果显示LLMs尚无法准确遵循复杂指令来精确使用函数调用，得分最高仅为60%，远低于人类的97%。这些结果突显了在这一领域需要进一步的发展。

English

Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

BigCodeBench：使用多样功能调用和复杂指令对代码生成进行基准测试

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

摘要

Summary

Support

Support