预测即将到来词元的顺序可提升语言建模效果

摘要

多令牌预测（MTP）作为一种辅助目标被提出，旨在提升语言模型训练中的下一令牌预测（NTP）能力，但其改进效果并不稳定，在标准自然语言处理基准测试中表现欠佳。我们认为，MTP对确切未来令牌的预测作为辅助损失过于困难。为此，我们提出了令牌顺序预测（TOP），该方法通过排序学习损失训练模型，根据临近程度对即将到来的令牌进行排序。与MTP需要多个Transformer层相比，TOP仅需增加一个额外的解嵌入层。我们使用NTP、MTP和TOP目标对340M、1.8B和7B参数的模型进行了预训练。在八个标准自然语言处理基准测试上的结果表明，即使在大规模模型上，TOP总体上仍优于NTP和MTP。我们的代码已发布于https://github.com/zaydzuhri/token-order-prediction。

English

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

预测即将到来词元的顺序可提升语言建模效果

Predicting the Order of Upcoming Tokens Improves Language Modeling

摘要

Support