预测即将到来词元的顺序可提升语言建模效果
Predicting the Order of Upcoming Tokens Improves Language Modeling
August 26, 2025
作者: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
cs.AI
摘要
多令牌预测(MTP)作为一种辅助目标被提出,旨在提升语言模型训练中的下一令牌预测(NTP)能力,但其改进效果并不稳定,在标准自然语言处理基准测试中表现欠佳。我们认为,MTP对确切未来令牌的预测作为辅助损失过于困难。为此,我们提出了令牌顺序预测(TOP),该方法通过排序学习损失训练模型,根据临近程度对即将到来的令牌进行排序。与MTP需要多个Transformer层相比,TOP仅需增加一个额外的解嵌入层。我们使用NTP、MTP和TOP目标对340M、1.8B和7B参数的模型进行了预训练。在八个标准自然语言处理基准测试上的结果表明,即使在大规模模型上,TOP总体上仍优于NTP和MTP。我们的代码已发布于https://github.com/zaydzuhri/token-order-prediction。
English
Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to
improve next-token prediction (NTP) in language model training but shows
inconsistent improvements, underperforming in standard NLP benchmarks. We argue
that MTP's exact future token prediction is too difficult as an auxiliary loss.
Instead, we propose Token Order Prediction (TOP), which trains models to order
upcoming tokens by their proximity using a learning-to-rank loss. TOP requires
only a single additional unembedding layer compared to MTP's multiple
transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using
NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show
that TOP overall outperforms both NTP and MTP even at scale. Our code is
available at https://github.com/zaydzuhri/token-order-prediction