OpenCodeInterpreter：将代码生成与执行和细化相结合

摘要

大型语言模型的引入显著推动了代码生成的发展。然而，开源模型通常缺乏像GPT-4代码解释器这样先进系统的执行能力和迭代改进。为了解决这一问题，我们推出了OpenCodeInterpreter，这是一个旨在生成、执行和迭代改进代码的开源代码系统系列。通过Code-Feedback支持，该数据集包含68K个多轮交互，OpenCodeInterpreter将执行和人类反馈整合起来，用于动态代码改进。我们对OpenCodeInterpreter在HumanEval、MBPP以及EvalPlus增强版本等关键基准上的全面评估显示出其出色的性能。值得注意的是，OpenCodeInterpreter-33B在HumanEval和MBPP的平均（以及增强版本）上分别达到了83.2（76.4）的准确率，与GPT-4的84.2（76.2）紧密竞争，并通过从GPT-4合成的人类反馈进一步提升至91.6（84.6）。OpenCodeInterpreter弥合了开源代码生成模型与GPT-4代码解释器等专有系统之间的差距。

English

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.