Magicoder:源代碼就是你所需的一切
Magicoder: Source Code Is All You Need
December 4, 2023
作者: Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang
cs.AI
摘要
我們介紹了 Magicoder,這是一系列完全開源(包括程式碼、權重和資料)的大型語言模型(LLMs),專門用於程式碼,能夠在不超過 70 億參數的情況下,顯著縮小與頂尖程式碼模型之間的差距。Magicoder 模型是通過使用 OSS-Instruct 這一新方法,在 75K 個合成指令資料上進行訓練的,該方法利用開源程式碼片段為語言模型提供高質量的指令資料。我們的主要動機是通過為語言模型提供豐富的開源參考資料,來減輕語言模型生成的合成資料固有的偏見,以生成更多元化、真實和可控的資料。OSS-Instruct 與其他資料生成方法(如 Evol-Instruct)的正交性進一步使我們能夠構建增強型的 MagicoderS。Magicoder 和 MagicoderS 在各種編碼基準測試中都明顯優於同等或甚至更大尺寸的最先進程式碼模型,包括 Python 文本轉程式碼生成、多語言編碼和數據科學程式完成。值得注意的是,基於 CodeLlama 的 MagicoderS-CL-7B 甚至在 HumanEval+(pass@1 中為 66.5 對 65.9)上超越了知名的 ChatGPT。總的來說,OSS-Instruct 開啟了一個利用豐富的開源參考資料進行低偏見和高質量指令調整的新方向。
English
We introduce Magicoder, a series of fully open-source (code, weights, and
data) Large Language Models (LLMs) for code that significantly closes the gap
with top code models while having no more than 7B parameters. Magicoder models
are trained on 75K synthetic instruction data using OSS-Instruct, a novel
approach to enlightening LLMs with open-source code snippets to generate
high-quality instruction data for code. Our main motivation is to mitigate the
inherent bias of the synthetic data generated by LLMs by empowering them with a
wealth of open-source references for the production of more diverse, realistic,
and controllable data. The orthogonality of OSS-Instruct and other data
generation methods like Evol-Instruct further enables us to build an enhanced
MagicoderS. Both Magicoder and MagicoderS substantially outperform
state-of-the-art code models with similar or even larger sizes on a wide range
of coding benchmarks, including Python text-to-code generation, multilingual
coding, and data-science program completion. Notably, MagicoderS-CL-7B based on
CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in
pass@1). Overall, OSS-Instruct opens a new direction for low-bias and
high-quality instruction tuning using abundant open-source references.