マルチトークン予測にはレジスタが必要である

要旨

マルチトークン予測は、言語モデルの事前学習を改善するための有望な目的として注目を集めているが、その利点はファインチューニングなどの他の設定に一貫して一般化されていない。本論文では、MuToRを提案する。これは、入力シーケンスに学習可能なレジスタトークンを交互に挿入し、各トークンが将来のターゲットを予測するタスクを担う、シンプルで効果的なマルチトークン予測手法である。既存の手法と比較して、MuToRはいくつかの重要な利点を提供する：追加されるパラメータ数が無視できるほど少ないこと、アーキテクチャの変更を必要としないため、既存の事前学習済み言語モデルとの互換性を保証すること、そして次のトークン予測という事前学習目的に沿っているため、特に教師ありファインチューニングに適していることである。さらに、スケーラブルな予測期間を自然にサポートする。我々は、言語および視覚領域における挑戦的な生成タスクにおいて、教師ありファインチューニング、パラメータ効率的なファインチューニング（PEFT）、および事前学習を含む幅広いユースケースでMuToRの有効性と汎用性を実証する。我々のコードは、https://github.com/nasosger/MuToR で公開される予定である。

English

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

マルチトークン予測にはレジスタが必要である

Multi-Token Prediction Needs Registers

要旨

Summary

Support

Support