CRUXEval: コード推論、理解、実行のためのベンチマーク

要旨

私たちはCRUXEval（Code Reasoning, Understanding, and eXecution Evaluation）を紹介します。これは800のPython関数（3-13行）からなるベンチマークです。各関数には入力と出力のペアが付属しており、入力予測と出力予測という2つの自然なタスクが導かれます。まず、この実行ベンチマークを生成するための汎用的なレシピを提案し、これを用いて将来のバリエーションを作成できるようにします。次に、20のコードモデルをこのベンチマークで評価し、HumanEvalで高得点を記録した多くの最近のモデルが、このベンチマークでは同じような改善を示さないことを発見します。さらに、単純なCoT（Chain of Thought）やファインチューニングのスキームがベンチマークのパフォーマンスを向上させるが、完全に解決するには程遠いことを示します。最良の設定であるGPT-4 with CoTは、入力予測と出力予測でそれぞれ75%と81%のpass@1を達成します。一方、Code Llama 34Bは入力予測と出力予測でそれぞれ50%と46%のpass@1を達成し、オープンソースモデルとクローズドソースモデルの間のギャップを浮き彫りにします。どのモデルもCRUXEvalを完璧にこなすには程遠いため、GPT-4が単純なプログラムで一貫して失敗する例を提供し、そのコード推論能力と改善すべき点を考察します。

English

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

CRUXEval: コード推論、理解、実行のためのベンチマーク

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

要旨

Support