透過在線策略蒸餾實現數據高效的自回歸到擴散語言模型

摘要

我們研究自回歸模型（ARLM）轉換為擴散語言模型（DLM）的過程。不同於從頭開始預訓練，先前的工作將 ARLM 中的因果注意力替換為雙向注意力，然後使用 DLM 目標來訓練生成的模型。然而，這些方法會引發兩種分布偏移。首先，從下一個詞預測目標轉換為 DLM 目標，可能會丟棄 ARLM 在訓練過程中獲得的知識。其次，標準的 DLM 存在訓練與推論不匹配的問題，因為訓練損失是針對隨機遮罩序列定義的，而非推論時遇到的、由基於信心的解碼所產生的軌跡。為了解決這兩個挑戰，我們引入了一種在線策略擴散語言模型（OPDLM），其中採用在線策略蒸餾（OPD）來實現 ARLM 到 DLM 的轉換。具體而言，OPDLM 通過自我在線策略蒸餾進行訓練：學生模型（一個具有雙向注意力的 ARLM）生成自己的軌跡，而教師模型（原始凍結的 ARLM）則通過在這些軌跡上提供目標 logits 來蒸餾其知識。通過直接以在線策略方式訓練，OPDLM 消除了 DLM 中的訓練與推論不匹配問題，而從原始模型進行蒸餾則增強了從 ARLM 保留知識的能力。實驗結果表明，OPDLM 所需的訓練 token 減少了 15 倍到 7000 倍，同時在各種任務中展現出強勁的性能。OPDLM 避免了 DLM 預訓練的高昂成本，並將 DLM 轉換定位為 ARLM 後訓練的一種形式。

English

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.