多令牌預測需依賴寄存器

摘要

多令牌预测作为一种提升语言模型预训练效果的有前景目标已崭露头角，但其优势并未一贯地推广至诸如微调等其他场景。本文提出MuToR，一种简单而有效的多令牌预测方法，该方法将可学习的寄存器令牌交错嵌入输入序列中，每个令牌负责预测未来目标。与现有方法相比，MuToR具备几大关键优势：它仅引入极少量的额外参数，无需改动模型架构——确保了与现成预训练语言模型的兼容性——并且与下一令牌预训练目标保持一致，使其特别适用于监督微调。此外，它天然支持可扩展的预测视野。我们通过一系列用例，包括监督微调、参数高效微调（PEFT）及预训练，在语言与视觉领域内的挑战性生成任务上，展示了MuToR的有效性与多功能性。我们的代码将发布于：https://github.com/nasosger/MuToR。

English

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

多令牌預測需依賴寄存器

Multi-Token Prediction Needs Registers

摘要

Summary

Support

Support