BigO(Bench) —— 大型语言模型能否生成具有可控时空复杂度的代码?
BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
March 19, 2025
作者: Pierre Chambon, Baptiste Roziere, Benoit Sagot, Gabriel Synnaeve
cs.AI
摘要
我们推出BigO(Bench),这是一个新颖的编程基准测试,旨在评估生成式语言模型在理解和生成具有特定时间与空间复杂度代码方面的能力。该基准测试填补了当前评估中的空白,这些评估往往忽视了模型在理解并生成受计算复杂度约束代码方面的能力。BigO(Bench)包含一套工具,能够通过性能分析测量推断任何Python函数的算法复杂度,无论是人类还是大型语言模型(LLM)生成的解决方案。此外,BigO(Bench)还囊括了来自编程竞赛的3,105个编程问题及1,190,250个解决方案,这些方案均附有基于复杂度框架推断出的(合成)时间与空间复杂度标签,以及针对大量输入规模对应的运行时和内存占用值。我们展示了多个顶尖语言模型在此基准测试上的评估结果,揭示了它们在处理复杂度要求时的优势与不足。特别指出的是,基于令牌空间推理的模型在代码生成方面无可匹敌,但在复杂度理解上表现欠佳,暗示它们可能无法很好地泛化到训练时未给予奖励的任务上。
English
We introduce BigO(Bench), a novel coding benchmark designed to evaluate the
capabilities of generative language models in understanding and generating code
with specified time and space complexities. This benchmark addresses the gap in
current evaluations that often overlook the ability of models to comprehend and
produce code constrained by computational complexity. BigO(Bench) includes
tooling to infer the algorithmic complexity of any Python function from
profiling measurements, including human- or LLM-generated solutions.
BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250
solutions from Code Contests annotated with inferred (synthetic) time and space
complexity labels from the complexity framework, as well as corresponding
runtime and memory footprint values for a large set of input sizes. We present
results from evaluating multiple state-of-the-art language models on this
benchmark, highlighting their strengths and weaknesses in handling complexity
requirements. In particular, token-space reasoning models are unrivaled in code
generation but not in complexity understanding, hinting that they may not
generalize well to tasks for which no reward was given at training time.Summary
AI-Generated Summary