MDM-Prime-v2：二元编码与索引重排实现扩散语言模型的计算最优扩展

摘要

掩码扩散模型（MDM）在采用部分掩码策略（Prime）进行学习时展现出卓越的泛化能力。该方法将词元转化为子词元，并在子词元级别对扩散过程进行建模。我们发现MDM-Prime框架存在两个局限性：首先，缺乏指导子词元化器中词元粒度超参数选择的工具；其次，当与常用的字节对编码（BPE）词元化器结合时，子词元化器的函数形式会显著降低似然估计的准确性。针对这些局限，我们研究了MDM-Prime中变分下界的紧致性，并开发出融合二进制编码与索引重排技术的MDM-Prime-v2掩码扩散语言模型。缩放分析表明，该模型的计算效率比自回归模型（ARM）提升21.8倍。在计算最优对比中，MDM-Prime-v2在OpenWebText数据集上达到7.77的困惑度，优于ARM（12.99）、MDM（18.94）和MDM-Prime（13.41）。当模型参数扩展至11亿时，我们的模型在多种常识推理任务中进一步展现出卓越的零样本准确率。

English

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8times more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.