InverseCoder:通过Inverse-Instruct释放指令调整代码LLMs的力量
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct
July 8, 2024
作者: Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen
cs.AI
摘要
最近开源代码大型语言模型(LLMs)的进展展示了通过在强大的闭源LLMs(如GPT-3.5和GPT-4)生成的数据上进行微调,具有卓越的编码能力,用于指令调整。本文探讨了如何通过从自身生成数据而不是查询闭源LLMs来进一步改进指令调整的代码LLM。我们的关键观察是正式语言和非正式语言之间的翻译存在不一致性:将正式语言(即代码)翻译为非正式语言(即自然语言)比反之更为直接。基于这一观察,我们提出了INVERSE-INSTRUCT,它从代码片段中总结指令而非相反。具体而言,给定用于代码的指令调整语料库和生成的指令调整代码LLM,我们要求代码LLM通过代码摘要和自我评估为原始语料库生成额外的高质量指令。然后,我们对基础LLM进行微调,结合原始语料库和自动生成的语料库,从而产生更强大的指令调整LLM。我们提出了一系列名为InverseCoder的代码LLMs,它在各种基准测试中超越了原始代码LLMs的性能,包括Python文本到代码生成、多语言编码和数据科学代码生成。
English
Recent advancements in open-source code large language models (LLMs) have
demonstrated remarkable coding abilities by fine-tuning on the data generated
from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction
tuning. This paper explores how to further improve an instruction-tuned code
LLM by generating data from itself rather than querying closed-source LLMs. Our
key observation is the misalignment between the translation of formal and
informal languages: translating formal language (i.e., code) to informal
language (i.e., natural language) is more straightforward than the reverse.
Based on this observation, we propose INVERSE-INSTRUCT, which summarizes
instructions from code snippets instead of the reverse. Specifically, given an
instruction tuning corpus for code and the resulting instruction-tuned code
LLM, we ask the code LLM to generate additional high-quality instructions for
the original corpus through code summarization and self-evaluation. Then, we
fine-tune the base LLM on the combination of the original corpus and the
self-generated one, which yields a stronger instruction-tuned LLM. We present a
series of code LLMs named InverseCoder, which surpasses the performance of the
original code LLMs on a wide range of benchmarks, including Python text-to-code
generation, multilingual coding, and data-science code generation.Summary
AI-Generated Summary