預測即將到來的詞元順序能提升語言建模效果

摘要

多令牌預測（MTP）已被提出作為一種輔助目標，旨在改進語言模型訓練中的下一個令牌預測（NTP），但其改進效果不一致，在標準自然語言處理基準測試中表現不佳。我們認為，MTP精確預測未來令牌作為輔助損失過於困難。相反，我們提出了令牌順序預測（TOP），該方法訓練模型使用學習排序損失來根據其接近程度對即將到來的令牌進行排序。與MTP需要多個Transformer層相比，TOP僅需增加一個解嵌入層。我們使用NTP、MTP和TOP目標對340M、1.8B和7B參數的模型進行了預訓練。在八個標準自然語言處理基準測試上的結果表明，即使在大規模情況下，TOP總體上也優於NTP和MTP。我們的代碼可在https://github.com/zaydzuhri/token-order-prediction獲取。

English

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

預測即將到來的詞元順序能提升語言建模效果

Predicting the Order of Upcoming Tokens Improves Language Modeling

摘要

Support