BigO(Bench) —— 大型語言模型能否生成具有可控時間與空間複雜度的程式碼?
BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
March 19, 2025
作者: Pierre Chambon, Baptiste Roziere, Benoit Sagot, Gabriel Synnaeve
cs.AI
摘要
我們推出BigO(Bench),這是一個新穎的編碼基準測試,旨在評估生成式語言模型在理解和生成具有特定時間和空間複雜度代碼方面的能力。該基準測試填補了當前評估中的空白,這些評估往往忽視了模型理解和生成受計算複雜度約束的代碼的能力。BigO(Bench)包含工具,可從性能分析測量中推斷任何Python函數的算法複雜度,包括人類或LLM生成的解決方案。BigO(Bench)還包含一組3,105個編碼問題和1,190,250個來自編程競賽的解決方案,這些解決方案都標註了從複雜度框架推斷出的(合成)時間和空間複雜度標籤,以及針對大量輸入規模的相應運行時間和內存佔用值。我們展示了在此基準測試上評估多個最先進語言模型的結果,突出了它們在處理複雜度要求方面的優勢和不足。特別是,基於token空間推理的模型在代碼生成方面無與倫比,但在複雜度理解方面卻不然,這暗示它們可能無法很好地泛化到訓練時未給予獎勵的任務。
English
We introduce BigO(Bench), a novel coding benchmark designed to evaluate the
capabilities of generative language models in understanding and generating code
with specified time and space complexities. This benchmark addresses the gap in
current evaluations that often overlook the ability of models to comprehend and
produce code constrained by computational complexity. BigO(Bench) includes
tooling to infer the algorithmic complexity of any Python function from
profiling measurements, including human- or LLM-generated solutions.
BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250
solutions from Code Contests annotated with inferred (synthetic) time and space
complexity labels from the complexity framework, as well as corresponding
runtime and memory footprint values for a large set of input sizes. We present
results from evaluating multiple state-of-the-art language models on this
benchmark, highlighting their strengths and weaknesses in handling complexity
requirements. In particular, token-space reasoning models are unrivaled in code
generation but not in complexity understanding, hinting that they may not
generalize well to tasks for which no reward was given at training time.Summary
AI-Generated Summary