BigO(Bench) -- LLMは制御された時間と空間計算量でコードを生成できるか？

要旨

我々は、生成型言語モデルの時間的・空間的計算量を指定したコードの理解と生成能力を評価するための新しいコーディングベンチマーク「BigO(Bench)」を紹介する。このベンチマークは、計算量の制約を理解し、それに基づいてコードを生成するモデルの能力を見落としがちな現在の評価手法のギャップを埋めるものである。BigO(Bench)には、プロファイリング測定値から任意のPython関数のアルゴリズム的複雑性を推論するツールが含まれており、人間またはLLMが生成したソリューションも対象としている。また、BigO(Bench)には、複雑性フレームワークから推論された（合成された）時間的・空間的複雑性ラベルと、多数の入力サイズに対する対応する実行時間とメモリ使用量の値が注釈付けされた、Code Contestsからの3,105のコーディング問題と1,190,250のソリューションが含まれている。我々は、このベンチマークを用いて複数の最先端言語モデルを評価した結果を提示し、計算量の要件を扱う際のそれらの強みと弱みを明らかにする。特に、トークン空間推論モデルはコード生成においては他を寄せ付けないが、複雑性の理解においてはそうではないことから、トレーニング時に報酬が与えられなかったタスクに対してはうまく汎化しない可能性が示唆される。

English

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

BigO(Bench) -- LLMは制御された時間と空間計算量でコードを生成できるか？

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

要旨

Support