MDM-Prime-v2: バイナリエンコーディングとインデックスシャッフリングによる拡散言語モデルの計算最適スケーリングの実現

要旨

マスク拡散モデル（MDM）は、部分マスキング手法（Prime）を用いて学習する場合、優れた汎化性能を示す。この手法はトークンをサブトークンに変換し、拡散過程をサブトークン単位でモデル化する。我々はMDM-Primeフレームワークに2つの課題を確認した。第一に、サブトークン化におけるトークン粒度のハイパーパラメータ選択を導く手法が不足している。第二に、一般的に使用されるByte-Pair-Encoding（BPE）トークナイザーと組み合わせた場合、サブトークン化関数の形式が尤度推定を著しく劣化させることを発見した。これらの課題に対処するため、我々はMDM-Primeにおける変分下限の緊密性を検討し、二値符号化とインデックスシャッフリングを組み込んだマスク拡散言語モデルMDM-Prime-v2を開発した。スケーリング分析により、MDM-Prime-v2は自己回帰モデル（ARM）よりも21.8倍の計算効率を達成することが明らかになった。計算最適化比較では、MDM-Prime-v2はOpenWebTextで7.77のパープレキシティを達成し、ARM（12.99）、MDM（18.94）、MDM-Prime（13.41）を上回った。モデルサイズを11億パラメータに拡張した場合、本モデルは常識推論タスクにおいて優れたゼロショット精度をさらに示した。

English

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8times more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

MDM-Prime-v2: バイナリエンコーディングとインデックスシャッフリングによる拡散言語モデルの計算最適スケーリングの実現

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

要旨

Support