MDM-Prime-v2：二元编码与索引混洗技术实现扩散语言模型的计算最优扩展

摘要

掩碼擴散模型（MDM）在採用部分掩碼方案（Prime）進行學習時，展現出卓越的泛化能力。該方法將詞元轉換為子詞元，並在子詞元層級建模擴散過程。我們發現MDM-Prime框架存在兩項侷限性：首先，缺乏指導子詞元化器中詞元粒度超參數選擇的工具；其次，當與常用的位元組對編碼（BPE）詞元化器結合使用時，子詞元化器的函數形式會顯著降低似然估計的準確性。為解決這些問題，我們研究了MDM-Prime中變分下界的緊密性，並開發出融合二進制編碼與索引重排技術的掩碼擴散語言模型MDM-Prime-v2。尺度分析表明，MDM-Prime-v2的計算效率比自回歸模型（ARM）提升21.8倍。在計算最優化對比中，MDM-Prime-v2在OpenWebText數據集上達到7.77的困惑度，優於ARM（12.99）、MDM（18.94）和MDM-Prime（13.41）。當模型規模擴展至11億參數時，我們的模型在多項常識推理任務中進一步展現出卓越的零樣本準確率。

English

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8times more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

MDM-Prime-v2：二元编码与索引混洗技术实现扩散语言模型的计算最优扩展

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

摘要

Support