Magicoder:源代码就是你所需要的。
Magicoder: Source Code Is All You Need
December 4, 2023
作者: Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang
cs.AI
摘要
我们介绍了Magicoder,这是一系列完全开源(代码、权重和数据)的大型语言模型(LLMs),用于代码,可以显著缩小与顶级代码模型之间的差距,同时参数数量不超过70亿。Magicoder模型是在75K合成指令数据上训练的,使用了一种新方法OSS-Instruct,通过开源代码片段启发LLMs生成高质量的代码指令数据。我们的主要动机是通过为其提供丰富的开源参考资料,使LLMs能够生成更多样化、更真实和可控的数据,从而缓解LLMs生成的合成数据固有的偏见。OSS-Instruct与Evol-Instruct等其他数据生成方法的正交性进一步使我们能够构建增强型MagicoderS。Magicoder和MagicoderS在各种编码基准测试中都明显优于同等甚至更大规模的最先进代码模型,包括Python文本到代码生成、多语言编码和数据科学程序完成。值得注意的是,基于CodeLlama的MagicoderS-CL-7B甚至在HumanEval+上超过了知名的ChatGPT(在pass@1方面为66.5比65.9)。总的来说,OSS-Instruct为使用丰富的开源参考资料进行低偏见和高质量指令调整开辟了新方向。
English
We introduce Magicoder, a series of fully open-source (code, weights, and
data) Large Language Models (LLMs) for code that significantly closes the gap
with top code models while having no more than 7B parameters. Magicoder models
are trained on 75K synthetic instruction data using OSS-Instruct, a novel
approach to enlightening LLMs with open-source code snippets to generate
high-quality instruction data for code. Our main motivation is to mitigate the
inherent bias of the synthetic data generated by LLMs by empowering them with a
wealth of open-source references for the production of more diverse, realistic,
and controllable data. The orthogonality of OSS-Instruct and other data
generation methods like Evol-Instruct further enables us to build an enhanced
MagicoderS. Both Magicoder and MagicoderS substantially outperform
state-of-the-art code models with similar or even larger sizes on a wide range
of coding benchmarks, including Python text-to-code generation, multilingual
coding, and data-science program completion. Notably, MagicoderS-CL-7B based on
CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in
pass@1). Overall, OSS-Instruct opens a new direction for low-bias and
high-quality instruction tuning using abundant open-source references.