离散扩散语言模型的标度行为
Scaling Behavior of Discrete Diffusion Language Models
December 11, 2025
作者: Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto
cs.AI
摘要
现代大型语言模型预训练消耗海量计算资源和训练数据,使得不同模型的扩展行为(即扩展定律)成为关键区分因素。离散扩散语言模型作为自回归语言模型的替代方案被提出,但其扩展规律尚未得到充分探索——已有研究指出该类模型需要更多数据和计算资源才能达到自回归模型的性能水平。
我们通过平滑插值掩码扩散与均匀扩散两种噪声类型,并重点关注批次大小和学习率等关键超参数,系统研究了离散扩散模型的扩展特性。实验表明,离散扩散模型的扩展行为强烈依赖于噪声类型,且与自回归模型存在显著差异。虽然所有噪声类型在计算受限的扩展中都会收敛至相近的损失值,但发现均匀扩散相比掩码扩散在计算效率优化训练中需要更多参数和更少数据,这使其在数据受限场景中具有应用潜力。我们将均匀扩散模型规模扩大至100亿参数,训练计算量达10^22 FLOPs,不仅验证了预测的扩展规律,更使其成为目前公开已知的最大规模均匀扩散模型。
English
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^{22} FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.