Rho-1：並非所有的標記都是你所需的。

摘要

先前的語言模型預訓練方法均對所有訓練標記應用了下一個標記預測損失。挑戰這種常規，我們提出"語言模型訓練中並非所有語料庫中的標記都同等重要"的觀點。我們的初步分析深入探討了語言模型的標記級別訓練動態，揭示了不同標記的明顯損失模式。利用這些見解，我們引入了一個名為 Rho-1 的新語言模型。與傳統的語言模型不同，傳統語言模型學習預測語料庫中的每個下一個標記，Rho-1 使用選擇性語言建模（SLM），有針對性地訓練與所需分佈對齊的有用標記。這種方法涉及使用參考模型對預訓練標記進行評分，然後通過對具有較高過度損失的標記進行專注損失的訓練來訓練語言模型。在對 15B OpenWebMath 語料庫進行持續預訓練時，Rho-1 在 9 個數學任務的少數樣本準確度方面取得了高達 30% 的絕對改進。在微調後，Rho-1-1B 和 7B 在 MATH 數據集上實現了 40.6% 和 51.8% 的最新結果，僅使用 3% 的預訓練標記就與 DeepSeekMath 相匹配。此外，在對 80B 通用標記進行預訓練時，Rho-1 在 15 個不同任務中實現了平均 6.8% 的增強，提高了語言模型預訓練的效率和性能。

English

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Rho-1：並非所有的標記都是你所需的。

Rho-1: Not All Tokens Are What You Need

摘要

Support