ChatPaper.aiChatPaper

StarCoder 2 和 The Stack v2:下一代

StarCoder 2 and The Stack v2: The Next Generation

February 29, 2024
作者: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
cs.AI

摘要

BigCode計劃是一個開放科學合作項目,專注於負責任地開發用於程式碼的大型語言模型(Code LLMs),並推出StarCoder2。我們與Software Heritage(SWH)合作,在其源代碼存檔的數位共享平台上構建了The Stack v2。除了SWH存儲的619種編程語言的存儲庫外,我們還精心挑選其他高質量的數據來源,如GitHub拉取請求、Kaggle筆記本和代碼文檔。這導致訓練集比第一個StarCoder數據集大4倍。我們使用3.3到4.3萬億令牌對StarCoder2模型進行了3B、7B和15B參數的訓練,並在一套全面的Code LLM基準測試中進行了全面評估。我們發現,我們的小型模型StarCoder2-3B在大多數基準測試中優於其他相同大小的Code LLM,並且優於StarCoderBase-15B。我們的大型模型StarCoder2-15B明顯優於其他相同大小的模型。此外,它與CodeLlama-34B匹敵,後者是其兩倍大小的模型。儘管DeepSeekCoder-33B是高資源語言代碼完成的表現最佳模型,但我們發現StarCoder2-15B在數學和代碼推理基準測試以及幾種低資源語言上的表現優於它。我們通過OpenRAIL許可證提供模型權重,並通過發布源代碼數據的SoftWare Heritage持久標識符(SWHIDs)確保訓練數據的完全透明。
English
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
PDF1475December 15, 2024