ChatPaper.aiChatPaper

星际编码者:愿源码与你同在!

StarCoder: may the source be with you!

May 9, 2023
作者: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
cs.AI

摘要

BigCode社区是一个开放的科学合作社区,致力于负责任地开发用于代码的大型语言模型(Code LLMs)。该社区推出了StarCoder和StarCoderBase:这是具有155亿参数模型和8K上下文长度、填充功能以及通过多查询注意力实现快速大批量推理的模型。StarCoderBase是在来自The Stack的1万亿令牌数据上训练的,The Stack是一个包含许可证允许的大量GitHub存储库的集合,具有检查工具和选择退出流程。我们在35亿Python令牌上对StarCoderBase进行了微调,从而创建了StarCoder。我们对迄今为止最全面的Code LLMs进行了评估,并表明StarCoderBase优于每个支持多种编程语言的开放式Code LLM,并且与OpenAI的code-cushman-001模型相匹配或优于其。此外,StarCoder优于每个在Python上进行微调的模型,可以被提示以在HumanEval上实现40\%的一次通过率,并且仍保持其在其他编程语言上的性能。我们采取了几项重要措施来实现安全的开放式模型发布,包括改进的PII剔除管道和一种新颖的归因追踪工具,并将StarCoder模型以Open Responsible AI Model许可证的更具商业可行性版本公开提供。
English
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
PDF313December 15, 2024