InverseCoder：透過Inverse-Instruct釋放指令調整的代碼LLM的潛力

摘要

最近開源代碼大型語言模型（LLMs）的進展展示了通過在強大的封閉源LLMs（如GPT-3.5和GPT-4）生成的數據上進行微調，具有卓越的編碼能力，用於指令調整。本文探討如何通過從自身生成數據而不是查詢封閉源LLMs來進一步改進指令調整的代碼LLM。我們的關鍵觀察是正式語言和非正式語言之間的翻譯不一致：將正式語言（即代碼）翻譯為非正式語言（即自然語言）比反向操作更為簡單。基於這一觀察，我們提出了INVERSE-INSTRUCT，它從代碼片段中總結指令而不是相反。具體而言，給定用於代碼的指令調整語料庫和生成的指令調整代碼LLM，我們要求代碼LLM通過代碼總結和自我評估為原始語料庫生成額外的高質量指令。然後，我們對基礎LLM進行微調，使其結合原始語料庫和自行生成的語料庫，從而產生更強大的指令調整LLM。我們提出了一系列名為InverseCoder的代碼LLMs，它在廣泛的基準測試中超越了原始代碼LLMs的性能，包括Python文本轉代碼生成、多語言編碼和數據科學代碼生成。

English

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

InverseCoder：透過Inverse-Instruct釋放指令調整的代碼LLM的潛力

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

摘要

Support