CRUXEval：代码推理、理解和执行的基准测试

摘要

我们提出了CRUXEval（代码推理、理解和执行评估），这是一个基准测试，包含800个Python函数（3-13行）。每个函数都附带一个输入-输出对，导致两个自然任务：输入预测和输出预测。首先，我们提出了一个通用的方法来生成我们的执行基准测试，可以用来创建未来的基准测试变体。其次，我们在我们的基准测试中评估了二十个代码模型，并发现许多最近在HumanEval上得分很高的模型在我们的基准测试上并未显示出相同的改进。第三，我们展示了简单的CoT和微调方案可以提高我们基准测试的性能，但仍然远未解决问题。最佳设置是使用CoT的GPT-4，分别在输入和输出预测上达到了75%和81%的pass@1。相比之下，Code Llama 34B在输入和输出预测上的pass@1分别为50%和46%，突出了开源和闭源模型之间的差距。由于没有模型能够完全通过CRUXEval测试，我们提供了GPT-4在简单程序上连续失败的示例，以便深入了解其代码推理能力和改进方向。

English

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

CRUXEval：代码推理、理解和执行的基准测试

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

摘要

Support