CRUXEval:代码推理、理解和执行的基准测试
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
January 5, 2024
作者: Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang
cs.AI
摘要
我们提出了CRUXEval(代码推理、理解和执行评估),这是一个基准测试,包含800个Python函数(3-13行)。每个函数都附带一个输入-输出对,导致两个自然任务:输入预测和输出预测。首先,我们提出了一个通用的方法来生成我们的执行基准测试,可以用来创建未来的基准测试变体。其次,我们在我们的基准测试中评估了二十个代码模型,并发现许多最近在HumanEval上得分很高的模型在我们的基准测试上并未显示出相同的改进。第三,我们展示了简单的CoT和微调方案可以提高我们基准测试的性能,但仍然远未解决问题。最佳设置是使用CoT的GPT-4,分别在输入和输出预测上达到了75%和81%的pass@1。相比之下,Code Llama 34B在输入和输出预测上的pass@1分别为50%和46%,突出了开源和闭源模型之间的差距。由于没有模型能够完全通过CRUXEval测试,我们提供了GPT-4在简单程序上连续失败的示例,以便深入了解其代码推理能力和改进方向。
English
We present CRUXEval (Code Reasoning, Understanding, and eXecution
Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each
function comes with an input-output pair, leading to two natural tasks: input
prediction and output prediction. First, we propose a generic recipe for
generating our execution benchmark which can be used to create future variation
of the benchmark. Second, we evaluate twenty code models on our benchmark and
discover that many recent high-scoring models on HumanEval do not show the same
improvements on our benchmark. Third, we show that simple CoT and fine-tuning
schemes can improve performance on our benchmark but remain far from solving
it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75%
and 81% on input and output prediction, respectively. In contrast, Code Llama
34B achieves a pass@1 of 50% and 46% on input and output prediction,
highlighting the gap between open and closed source models. As no model is
close to acing CRUXEval, we provide examples of consistent GPT-4 failures on
simple programs as a lens into its code reasoning capabilities and areas for
improvement.