擴散語言模型是超級數據學習者
Diffusion Language Models are Super Data Learners
November 5, 2025
作者: Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh
cs.AI
摘要
在嚴格控制的預訓練設定下,我們觀察到一個交叉現象:當獨特數據有限時,擴散語言模型(DLM)通過增加訓練週期數,能持續超越自迴歸(AR)模型。這種交叉點會隨數據量增加或質量提升而延後出現,隨模型規模擴大而提前出現,並在稠密與稀疏架構中均保持穩定。我們將此優勢歸因於三個疊加因素:(1) 任意順序建模能力,(2) 迭代式雙向去噪帶來的超密集計算效應,以及(3) 內建的蒙地卡羅數據增強;雖然輸入噪聲或參數噪聲能改善數據受限下的AR模型表現,但無法彌合差距。大規模實驗中,一個17億參數的DLM在消耗約1.5兆token的計算預算、使用100億獨特Python token訓練後,性能超越了在嚴格匹配設定下訓練的AR編程模型。此外,一個10億參數的DLM僅使用10億token進行標準預訓練數據重複訓練,無需特殊技巧即可在HellaSwag達到超過56%的準確率,在MMLU超過33%。我們還證實,在此機制下,驗證集交叉熵的上升並不意味著下游任務性能的退化。
English
Under strictly controlled pre-training settings, we observe a Crossover: when
unique data is limited, diffusion language models (DLMs) consistently surpass
autoregressive (AR) models by training for more epochs. The crossover shifts
later with more or higher-quality data, earlier with larger models, and
persists across dense and sparse architectures. We attribute the gains to three
compounding factors: (1) any-order modeling, (2) super-dense compute from
iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation;
input or parameter noise improves AR under data constraint but cannot close the
gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B
unique Python tokens overtakes an AR coder trained with strictly matched
settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag
and > 33% on MMLU using only 1B tokens, without any special tricks, just by
repeating standard pre-training data. We also show that rising validation
cross-entropy does not imply degraded downstream performance in this regime.