OpenCodeInterpreter：將程式碼生成與執行和改進相結合

摘要

大型語言模型的引入顯著推進了程式碼生成。然而，開源模型通常缺乏像GPT-4程式碼解譯器這樣的先進系統的執行能力和迭代改進。為了解決這個問題，我們介紹了OpenCodeInterpreter，這是一個旨在生成、執行和迭代改進程式碼的開源程式碼系統家族。OpenCodeInterpreter受Code-Feedback支持，該數據集包含68K個多輪互動，將執行和人類反饋整合到動態程式碼改進中。我們對OpenCodeInterpreter在HumanEval、MBPP等主要基準測試中的全面評估揭示了其優異表現。值得注意的是，OpenCodeInterpreter-33B在HumanEval和MBPP的平均值（以及EvalPlus的增強版本）上實現了83.2（76.4）的準確率，與GPT-4的84.2（76.2）幾乎不相上下，並且在從GPT-4獲得的合成人類反饋下進一步提升至91.6（84.6）。OpenCodeInterpreter縮小了開源程式碼生成模型與GPT-4程式碼解譯器等專有系統之間的差距。

English

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.