BigO(Bench) —— 大型语言模型能否生成具有可控时空复杂度的代码？

摘要

我们推出BigO(Bench)，这是一个新颖的编程基准测试，旨在评估生成式语言模型在理解和生成具有特定时间与空间复杂度代码方面的能力。该基准测试填补了当前评估中的空白，这些评估往往忽视了模型在理解并生成受计算复杂度约束代码方面的能力。BigO(Bench)包含一套工具，能够通过性能分析测量推断任何Python函数的算法复杂度，无论是人类还是大型语言模型（LLM）生成的解决方案。此外，BigO(Bench)还囊括了来自编程竞赛的3,105个编程问题及1,190,250个解决方案，这些方案均附有基于复杂度框架推断出的（合成）时间与空间复杂度标签，以及针对大量输入规模对应的运行时和内存占用值。我们展示了多个顶尖语言模型在此基准测试上的评估结果，揭示了它们在处理复杂度要求时的优势与不足。特别指出的是，基于令牌空间推理的模型在代码生成方面无可匹敌，但在复杂度理解上表现欠佳，暗示它们可能无法很好地泛化到训练时未给予奖励的任务上。

English

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

BigO(Bench) —— 大型语言模型能否生成具有可控时空复杂度的代码？

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

摘要

Support