다중 토큰 예측은 레지스터가 필요하다

초록

다중 토큰 예측은 언어 모델 사전 학습을 개선하기 위한 유망한 목표로 부상했지만, 그 이점은 미세 조정과 같은 다른 설정에서 일관되게 일반화되지 못했습니다. 본 논문에서는 입력 시퀀스에 학습 가능한 레지스터 토큰을 교차 배치하여 각각 미래의 목표를 예측하도록 하는 간단하면서도 효과적인 다중 토큰 예측 접근법인 MuToR을 제안합니다. 기존 방법과 비교하여 MuToR은 몇 가지 주요 장점을 제공합니다: 추가 매개변수의 수가 미미하며, 아키텍처 변경이 필요 없어 기존의 사전 학습된 언어 모델과의 호환성을 보장하고, 다음 토큰 사전 학습 목표와 일치하여 지도 학습 기반 미세 조합에 특히 적합합니다. 또한, 확장 가능한 예측 범위를 자연스럽게 지원합니다. 우리는 언어 및 비전 도메인에서의 도전적인 생성 작업을 포함한 다양한 사용 사례에서 MuToR의 효과성과 다용성을 입증합니다. 우리의 코드는 https://github.com/nasosger/MuToR에서 이용 가능할 예정입니다.

English

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

다중 토큰 예측은 레지스터가 필요하다

Multi-Token Prediction Needs Registers

초록

Summary

Support

Support