OpenCodeInterpreter: Integrazione della Generazione di Codice con Esecuzione e Affinamento

Abstract

L'introduzione di modelli linguistici di grandi dimensioni ha significativamente avanzato la generazione di codice. Tuttavia, i modelli open-source spesso mancano delle capacità di esecuzione e di raffinamento iterativo di sistemi avanzati come il GPT-4 Code Interpreter. Per affrontare questa lacuna, presentiamo OpenCodeInterpreter, una famiglia di sistemi di codice open-source progettati per generare, eseguire e raffinare iterativamente il codice. Supportato da Code-Feedback, un dataset che include 68K interazioni multi-turn, OpenCodeInterpreter integra l'esecuzione e il feedback umano per una raffinazione dinamica del codice. La nostra valutazione completa di OpenCodeInterpreter su benchmark chiave come HumanEval, MBPP e le loro versioni potenziate da EvalPlus rivela prestazioni eccezionali. In particolare, OpenCodeInterpreter-33B raggiunge un'accuratezza di 83.2 (76.4) sulla media (e versioni plus) di HumanEval e MBPP, avvicinandosi a GPT-4 con 84.2 (76.2) e raggiungendo ulteriormente 91.6 (84.6) con il feedback umano sintetizzato da GPT-4. OpenCodeInterpreter riduce il divario tra i modelli open-source di generazione di codice e i sistemi proprietari come GPT-4 Code Interpreter.

English

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.