オン方策蒸留によるデータ効率的な自己回帰型から拡散型への言語モデル

要旨

我々は、自己回帰モデル（ARLM）から拡散言語モデル（DLM）への変換について研究する。先行研究では、スクラッチからの事前学習を行わず、ARLMにおける因果的注意を双方向注意に置き換え、その結果得られたモデルをDLM目的関数で訓練する手法が取られてきた。しかし、これらのアプローチには2つの分布シフトが生じる。第一に、次トークン予測目的関数からDLM目的関数への移行により、ARLMが訓練中に獲得した知識が失われる可能性がある。第二に、標準的なDLMは訓練と推論のミスマッチを抱えており、訓練損失はランダムにマスクされた系列に対して定義される一方、推論時には信頼度に基づくデコードによって生成される軌跡が発生する。これらの2つの課題に対処するため、我々はオン方策拡散言語モデル（OPDLM）を導入する。OPDLMでは、ARLMからDLMへの変換にオン方策蒸留（OPD）を採用する。具体的には、OPDLMは自己OPDによって訓練され、生徒モデル（双方向注意を持つARLM）が自身の軌跡を生成し、教師モデル（元の凍結されたARLM）がこれらの軌跡に対する目標ロジットを提供することで知識を蒸留する。オン方策で直接訓練することにより、OPDLMはDLMにおける訓練と推論のミスマッチを解消し、元のモデルからの蒸留によってARLMの知識保持を強化する。実験結果は、OPDLMが広範なタスクにおいて強力な性能を発揮しつつ、訓練トークン数を15分の1から7,000分の1に削減することを示している。OPDLMはDLM事前学習の膨大なコストを回避し、DLM変換をARLMのポスト訓練の一形態として位置付ける。

English

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.