預測即將到來的詞元順序能提升語言建模效果
Predicting the Order of Upcoming Tokens Improves Language Modeling
August 26, 2025
作者: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
cs.AI
摘要
多令牌預測(MTP)已被提出作為一種輔助目標,旨在改進語言模型訓練中的下一個令牌預測(NTP),但其改進效果不一致,在標準自然語言處理基準測試中表現不佳。我們認為,MTP精確預測未來令牌作為輔助損失過於困難。相反,我們提出了令牌順序預測(TOP),該方法訓練模型使用學習排序損失來根據其接近程度對即將到來的令牌進行排序。與MTP需要多個Transformer層相比,TOP僅需增加一個解嵌入層。我們使用NTP、MTP和TOP目標對340M、1.8B和7B參數的模型進行了預訓練。在八個標準自然語言處理基準測試上的結果表明,即使在大規模情況下,TOP總體上也優於NTP和MTP。我們的代碼可在https://github.com/zaydzuhri/token-order-prediction獲取。
English
Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to
improve next-token prediction (NTP) in language model training but shows
inconsistent improvements, underperforming in standard NLP benchmarks. We argue
that MTP's exact future token prediction is too difficult as an auxiliary loss.
Instead, we propose Token Order Prediction (TOP), which trains models to order
upcoming tokens by their proximity using a learning-to-rank loss. TOP requires
only a single additional unembedding layer compared to MTP's multiple
transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using
NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show
that TOP overall outperforms both NTP and MTP even at scale. Our code is
available at https://github.com/zaydzuhri/token-order-prediction