InverseCoder：通过Inverse-Instruct释放指令调整代码LLMs的力量

摘要

最近开源代码大型语言模型（LLMs）的进展展示了通过在强大的闭源LLMs（如GPT-3.5和GPT-4）生成的数据上进行微调，具有卓越的编码能力，用于指令调整。本文探讨了如何通过从自身生成数据而不是查询闭源LLMs来进一步改进指令调整的代码LLM。我们的关键观察是正式语言和非正式语言之间的翻译存在不一致性：将正式语言（即代码）翻译为非正式语言（即自然语言）比反之更为直接。基于这一观察，我们提出了INVERSE-INSTRUCT，它从代码片段中总结指令而非相反。具体而言，给定用于代码的指令调整语料库和生成的指令调整代码LLM，我们要求代码LLM通过代码摘要和自我评估为原始语料库生成额外的高质量指令。然后，我们对基础LLM进行微调，结合原始语料库和自动生成的语料库，从而产生更强大的指令调整LLM。我们提出了一系列名为InverseCoder的代码LLMs，它在各种基准测试中超越了原始代码LLMs的性能，包括Python文本到代码生成、多语言编码和数据科学代码生成。

English

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

InverseCoder：通过Inverse-Instruct释放指令调整代码LLMs的力量

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

摘要

Support