北極雪碼者：揭開程式預訓練中高質量數據的神秘面紗

摘要

近期的研究越來越顯示高質量的數據對於語言模型的有效預訓練至關重要。然而，“高質量”的確切定義仍未被深入探討。著眼於代碼領域，我們介紹了Arctic-SnowCoder-1.3B，這是一個在555B令牌上進行預訓練的數據高效基礎代碼模型，通過三個階段逐步精煉的數據實現：(1) 通用預訓練，使用500B標準質量代碼令牌，經過基本篩選、去重和去污染的預處理；(2) 繼續預訓練，使用從第一階段中由BERT風格質量標註器選擇的50B高質量令牌，該標註器經過訓練以區分良好代碼和隨機數據，使用從高質量代碼文件中提取的正面示例，以及來自Magicoder和StarCoder2-Instruct的指導數據；(3) 增強預訓練，使用由Llama-3.1-70B創建的5B合成數據，使用第二階段數據作為種子，適應Magicoder方法進行預訓練。儘管在有限數據集上進行訓練，Arctic-SnowCoder在BigCodeBench上實現了最先進的性能，這是一個專注於實用和具有挑戰性的編程任務的編碼基準，與僅訓練不超過1T令牌的大小相似的模型相比，其性能超越了Phi-1.5-1.3B的36％。在所有評估基準中，Arctic-SnowCoder-1.3B擊敗了在1T令牌上進行預訓練的StarCoderBase-3B。此外，它與在數千億令牌上進行訓練的領先小型基礎代碼模型的性能相匹配。例如，Arctic-SnowCoder-1.3B在HumanEval+上超越了在超過3.3T令牌上進行預訓練的StarCoder2-3B，在這個基準上評估功能級代碼生成，並在BigCodeBench上保持競爭力。我們的評估提供了對Arctic-SnowCoder各種設計選擇的全面分析的證明。最重要的是，我們發現高質量數據的關鍵在於與下游應用程序的分佈對齊。

English

Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

北極雪碼者：揭開程式預訓練中高質量數據的神秘面紗

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

摘要

Support