北极-雪编码器：揭秘代码预训练中的高质量数据

摘要

近期研究越来越表明，高质量数据对于语言模型的有效预训练至关重要。然而，“高质量”的精确定义仍未被充分探讨。针对代码领域，我们介绍了Arctic-SnowCoder-1.3B，这是一个在555B标记上进行了预训练的数据高效基础代码模型，经历了三个阶段逐渐细化数据：（1）使用500B标准质量代码标记进行一般预训练，经过基本过滤、去重和净化预处理，（2）继续预训练，使用从第一阶段中通过BERT风格质量注释器选择的50B高质量标记，该注释器经过训练，能够区分好代码和随机数据，使用从高质量代码文件中提取的正例，以及来自Magicoder和StarCoder2-Instruct的指导数据，以及（3）增强预训练，使用由Llama-3.1-70B创建的5B合成数据，使用第二阶段数据作为种子，调整了Magicoder方法进行预训练。尽管在有限数据集上训练，Arctic-SnowCoder在BigCodeBench上取得了最先进的性能，这是一个专注于实际和具有挑战性的编程任务的编码基准，与仅训练不超过1T标记的大小相似的模型相比，其性能超过了Phi-1.5-1.3B的36％。在所有评估基准中，Arctic-SnowCoder-1.3B击败了在1T标记上进行预训练的StarCoderBase-3B。此外，它与在数万亿标记上进行训练的领先小型基础代码模型的性能相匹配。例如，Arctic-SnowCoder-1.3B在HumanEval+上超过了在超过3.3T标记上进行预训练的StarCoder2-3B，这是一个评估功能级代码生成的基准，并且在BigCodeBench上保持竞争力。我们的评估提供了对Arctic-SnowCoder各种设计选择的全面分析。最重要的是，我们发现高质量数据的关键在于与下游应用程序的分布对齐。

English

Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

北极-雪编码器：揭秘代码预训练中的高质量数据

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

摘要

Support