マスク拡散言語モデルの制御：デコードステップ削減による一貫性軌道強化学習

要旨

マスク拡散言語モデル（MDLM）は最近、自己回帰型（AR）言語モデルに代わる有望な選択肢として登場し、並列デコード、柔軟な生成順序、そしてより少ない推論ステップの可能性といった特性を提供しています。これらの利点にもかかわらず、MDLMに特化したデコード戦略や強化学習（RL）アルゴリズムはまだ十分に探求されていません。素朴なアプローチは、ARモデルで確立された技術を直接MDLMに転用することです。しかし、これには即座に疑問が生じます：そのような素朴な転用は本当に最適なのでしょうか？例えば、1）ブロック単位および半自己回帰型デコード戦略はMDLMの訓練中に使用されないのに、なぜ推論中に完全な拡散スタイルのデコードを上回るのか？2）ARモデル向けに設計されたRLアルゴリズムを直接MDLMに適用すると、MDLMのデコードが非因果的（並列的）であるため、訓練と推論の間に不整合が生じます。これにより、ロールアウト軌跡と最適化軌跡の間に不整合が生じます。これらの課題に対処するため、我々はEOS早期拒否（EOSER）と昇順ステップサイズ（ASS）デコードスケジューラを提案し、MDLMが完全な拡散スタイルのデコードを実行する可能性を引き出し、より少ないデコードステップで競争力のある性能を達成します。さらに、MDLMを制御するための一貫性軌跡グループ相対ポリシー最適化（CJ-GRPO）を導入し、ロールアウト軌跡と最適化軌跡の一貫性を強調し、スキップステップ最適化による最適化エラーを削減します。我々は、LLaDA-8B-Instructを使用して、数学や計画ベンチマークなどの推論タスクで広範な実験を行いました。結果は、提案されたEOSERとASSメカニズム、そしてCJ-GRPOが、MDLMを効果的かつ効率的に制御するための重要な可能性を秘めていることを示しています。コード：https://github.com/yjyddq/EOSER-ASS-RL。

English

Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

マスク拡散言語モデルの制御：デコードステップ削減による一貫性軌道強化学習

Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

要旨

Support