BigO(Bench) —— 大型語言模型能否生成具有可控時間與空間複雜度的程式碼？

摘要

我們推出BigO(Bench)，這是一個新穎的編碼基準測試，旨在評估生成式語言模型在理解和生成具有特定時間和空間複雜度代碼方面的能力。該基準測試填補了當前評估中的空白，這些評估往往忽視了模型理解和生成受計算複雜度約束的代碼的能力。BigO(Bench)包含工具，可從性能分析測量中推斷任何Python函數的算法複雜度，包括人類或LLM生成的解決方案。BigO(Bench)還包含一組3,105個編碼問題和1,190,250個來自編程競賽的解決方案，這些解決方案都標註了從複雜度框架推斷出的（合成）時間和空間複雜度標籤，以及針對大量輸入規模的相應運行時間和內存佔用值。我們展示了在此基準測試上評估多個最先進語言模型的結果，突出了它們在處理複雜度要求方面的優勢和不足。特別是，基於token空間推理的模型在代碼生成方面無與倫比，但在複雜度理解方面卻不然，這暗示它們可能無法很好地泛化到訓練時未給予獎勵的任務。

English

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

BigO(Bench) —— 大型語言模型能否生成具有可控時間與空間複雜度的程式碼？

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

摘要

Support