LFPO: マスク拡散モデルのための尤度不要方策最適化

要旨

検証可能な報酬を用いた強化学習（RLVR）は、特に数学的推論やコード生成のような正確性が求められる領域において、自己回帰モデルの改善で顕著な成果を収めてきた。しかし、このパラダイムを拡散大規模言語モデル（dLLM）に直接適用することは、正確な尤度計算が困難であるという根本的な問題により阻まれており、既存手法は高い分散を伴う近似に依存せざるを得ない。このギャップを埋めるため、我々はベクトル場の流れマッチングの概念を離散的なトークン空間に写像する新規フレームワーク「尤度自由方策最適化（LFPO）」を提案する。具体的には、LFPOはアラインメントを幾何学的な速度補正として定式化し、対照的更新を通じてノイズ除去ロジットを直接最適化する。この設計により、尤度近似に内在する誤差を効果的に回避し、精密な勾配推定を実現する。さらに、LFPOは中間ステップから最終解を予測することで一貫性を強制し、確率流を直線化することで、反復回数を大幅に削減しつつ高品質な生成を可能にする。大規模な実験により、LFPOがコード・推論ベンチマークにおいて既存の最先端手法を上回るだけでなく、拡散ステップの削減を通じて推論速度を約20%向上させることを実証した。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO: マスク拡散モデルのための尤度不要方策最適化

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

要旨

Support