今後のトークンの順序を予測することが言語モデルの性能向上に寄与する

要旨

マルチトークン予測（MTP）は、言語モデルの訓練における次トークン予測（NTP）を改善するための補助目的として提案されてきましたが、標準的なNLPベンチマークでは一貫した改善が見られず、性能が低いことが報告されています。本研究では、MTPの正確な未来トークン予測が補助損失として過度に困難であると主張します。代わりに、学習順序予測（TOP）を提案します。TOPは、学習順序損失を用いて、モデルに近接性に基づいて将来のトークンを順序付けることを訓練します。TOPは、MTPの複数のトランスフォーマーレイヤーと比較して、単一の追加のアンベディング層のみを必要とします。340M、1.8B、7BパラメータのモデルをNTP、MTP、TOPの目的で事前訓練しました。8つの標準NLPベンチマークでの結果は、TOPがスケールにおいてもNTPとMTPの両方を全体的に上回ることを示しています。私たちのコードはhttps://github.com/zaydzuhri/token-order-predictionで公開されています。

English

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

今後のトークンの順序を予測することが言語モデルの性能向上に寄与する

Predicting the Order of Upcoming Tokens Improves Language Modeling

要旨

Support