拡散言語モデルは超データ学習者である

要旨

厳密に管理された事前学習設定において、我々はクロスオーバー現象を観察した：ユニークデータが限定的な状況下では、拡散言語モデル（DLMs）がより多くのエポック学習することで、自己回帰（AR）モデルを一貫して凌駕するのである。このクロスオーバー点は、データ量の増加や高品質化に伴って後方にシフトし、モデルサイズの拡大に伴って前方に移動する。また、この現象は密なアーキテクチャと疎なアーキテクチャの両方で持続的に確認される。我々はこの性能向上を以下の3つの複合的要因に帰因する：(1)任意順序モデリング、(2)反復的双方向ノイズ除去による超密計算、(3)組み込みモンテカルロ拡張。入力またはパラメータノイズはデータ制約下でのARモデルの性能を改善するが、このギャップを埋めるには至らない。大規模設定では、約1.5Tトークンの計算予算で100億ユニークなPythonトークンを学習した1.7BパラメータのDLMが、厳密に同一条件で学習されたARコーダを逆転する。さらに、10億パラメータのDLMは、特殊な手法を用いず標準的な事前学習データを繰り返し学習するだけで、10億トークンのみを使用してHellaSwagで56%以上、MMLUで33%以上の精度を達成する。また、この領域では検証クロスエントロピーの上昇が下流タスクの性能劣化を意味しないことも示す。

English

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

拡散言語モデルは超データ学習者である

Diffusion Language Models are Super Data Learners

要旨

Support