OpenCodeInterpreter: 코드 생성과 실행 및 개선의 통합

초록

대형 언어 모델의 도입은 코드 생성 분야를 크게 발전시켰습니다. 그러나 오픈소스 모델들은 종종 GPT-4 코드 인터프리터와 같은 고급 시스템의 실행 능력과 반복적 개선 기능이 부족합니다. 이를 해결하기 위해, 우리는 코드 생성, 실행, 그리고 반복적 개선을 위해 설계된 오픈소스 코드 시스템인 OpenCodeInterpreter를 소개합니다. 68K개의 다중 턴 상호작용을 포함한 Code-Feedback 데이터셋을 기반으로, OpenCodeInterpreter는 실행과 인간의 피드백을 통합하여 동적 코드 개선을 가능하게 합니다. HumanEval, MBPP 및 EvalPlus에서 강화된 버전과 같은 주요 벤치마크에서 OpenCodeInterpreter에 대한 포괄적인 평가를 통해 뛰어난 성능을 확인했습니다. 특히, OpenCodeInterpreter-33B는 HumanEval과 MBPP의 평균(및 강화 버전)에서 83.2(76.4)의 정확도를 달성하며, GPT-4의 84.2(76.2)에 근접한 성능을 보였고, GPT-4로부터 합성된 인간 피드백을 통해 91.6(84.6)로 더욱 향상되었습니다. OpenCodeInterpreter는 오픈소스 코드 생성 모델과 GPT-4 코드 인터프리터와 같은 독점 시스템 간의 격차를 줄여줍니다.

English

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.