Rho-1：并非所有的标记都是你所需的。

摘要

先前的语言模型预训练方法一直将下一个标记的预测损失均匀地应用于所有训练标记。挑战这一规范，我们提出“语言模型训练并非所有语料库中的标记同等重要”。我们的初步分析深入研究了语言模型的标记级训练动态，揭示了不同标记的明显损失模式。利用这些见解，我们引入了一种名为Rho-1的新语言模型。与传统的语言模型不同，传统语言模型学习预测语料库中的每个下一个标记，Rho-1采用选择性语言建模（SLM），有选择地训练与期望分布对齐的有用标记。这种方法涉及使用参考模型对预训练标记进行评分，然后通过对具有更高过度损失的标记施加专注损失来训练语言模型。在对15B OpenWebMath语料库进行持续预训练时，Rho-1在9个数学任务的少样本准确率上取得了高达30%的绝对改进。在微调后，Rho-1-1B和7B在MATH数据集上分别实现了40.6%和51.8%的最新结果，与DeepSeekMath相匹配，仅使用了预训练标记的3%。此外，在对80B通用标记进行预训练时，Rho-1在15个不同任务中平均提高了6.8%，提高了语言模型预训练的效率和性能。

English

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Rho-1：并非所有的标记都是你所需的。

Rho-1: Not All Tokens Are What You Need

摘要

Support