InverseCoder: 인스트럭션 튜닝된 코드 LLM의 역-인스트럭트를 통한 잠재력 발휘

초록

최근 오픈소스 코드 대형 언어 모델(LLM)의 발전은 GPT-3.5 및 GPT-4와 같은 강력한 클로즈드소스 LLM에서 생성된 데이터를 미세 조정하여 놀라운 코딩 능력을 보여주었습니다. 본 논문은 클로즈드소스 LLM에 쿼리하는 대신, 명령어 튜닝된 코드 LLM을 스스로 생성한 데이터를 통해 어떻게 더욱 개선할 수 있는지 탐구합니다. 우리의 주요 관찰은 형식 언어(즉, 코드)와 비형식 언어(즉, 자연어) 간의 번역 불일치에 있습니다: 형식 언어를 비형식 언어로 번역하는 것이 그 반대보다 더 직관적입니다. 이 관찰을 바탕으로, 우리는 코드 스니펫에서 명령어를 요약하는 INVERSE-INSTRUCT를 제안합니다. 구체적으로, 코드에 대한 명령어 튜닝 코퍼스와 그 결과로 얻은 명령어 튜닝된 코드 LLM이 주어졌을 때, 코드 요약 및 자체 평가를 통해 원본 코퍼스에 대한 추가적인 고품질 명령어를 생성하도록 코드 LLM에 요청합니다. 그런 다음, 원본 코퍼스와 자체 생성된 코퍼스를 결합하여 기본 LLM을 미세 조정함으로써 더 강력한 명령어 튜닝된 LLM을 얻습니다. 우리는 InverseCoder라는 일련의 코드 LLM을 제시하며, 이는 Python 텍스트-코드 생성, 다국어 코딩, 데이터 과학 코드 생성 등 다양한 벤치마크에서 원본 코드 LLM의 성능을 능가합니다.

English

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

InverseCoder: 인스트럭션 튜닝된 코드 LLM의 역-인스트럭트를 통한 잠재력 발휘

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

초록

Support