다가오는 토큰의 순서를 예측하는 것은 언어 모델링을 개선한다

초록

다중 토큰 예측(Multi-Token Prediction, MTP)은 언어 모델 훈련에서 다음 토큰 예측(Next-Token Prediction, NTP)을 개선하기 위한 보조 목적으로 제안되었으나, 표준 NLP 벤치마크에서 일관되지 않은 성능 향상을 보이며 기대에 미치지 못하는 경우가 많다. 우리는 MTP의 정확한 미래 토큰 예측이 보조 손실 함수로서 너무 어렵다는 점을 지적한다. 대신, 우리는 학습 순위(learning-to-rank) 손실을 사용하여 모델이 다가올 토큰들을 근접성에 따라 순서를 매기도록 훈련하는 토큰 순서 예측(Token Order Prediction, TOP)을 제안한다. TOP는 MTP의 다중 트랜스포머 레이어와 비교하여 단일 추가 언임베딩(unembedding) 레이어만 필요로 한다. 우리는 340M, 1.8B, 7B 파라미터 규모의 모델을 NTP, MTP, TOP 목적으로 사전 훈련하였다. 8개의 표준 NLP 벤치마크에서의 결과는 TOP가 규모에 상관없이 전반적으로 NTP와 MTP를 모두 능가함을 보여준다. 우리의 코드는 https://github.com/zaydzuhri/token-order-prediction에서 확인할 수 있다.

English

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

다가오는 토큰의 순서를 예측하는 것은 언어 모델링을 개선한다

Predicting the Order of Upcoming Tokens Improves Language Modeling

초록

Support