CRUXEval：代碼推理、理解和執行的基準測試

摘要

我們提出了CRUXEval（代碼推理、理解和執行評估），這是一個基準測試，包含800個Python函數（3-13行）。每個函數都附帶一個輸入-輸出對，從而產生兩個自然任務：輸入預測和輸出預測。首先，我們提出了一個通用的方法來生成我們的執行基準測試，可用於創建基準測試的未來變化。其次，我們在我們的基準測試上評估了二十個代碼模型，發現許多最近在HumanEval上得分較高的模型在我們的基準測試上並沒有顯示同樣的改進。第三，我們展示了簡單的CoT和微調方案可以提高我們基準測試的性能，但仍遠未解決問題。最佳設置是搭配CoT的GPT-4，分別在輸入和輸出預測上實現了75%和81%的pass@1。相比之下，Code Llama 34B在輸入和輸出預測上的pass@1分別為50%和46%，凸顯了開源和封閉源模型之間的差距。由於沒有模型能夠完全通過CRUXEval，我們提供了GPT-4在簡單程序上連續失敗的例子，作為研究其代碼推理能力和改進領域的一個視角。

English

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

CRUXEval：代碼推理、理解和執行的基準測試

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

摘要

Support