LFPO：基於遮罩擴散模型的無似然策略優化

摘要

具備可驗證獎勵的強化學習（RLVR）在改進自迴歸模型方面取得了顯著成功，特別是在需要正確性的領域（如數學推理和代碼生成）。然而，將此類範式直接應用於擴散大型語言模型（dLLMs）存在根本性障礙，因為精確似然計算的不可行性迫使現有方法依賴高方差近似。為彌合這一差距，我們提出無似然策略優化（LFPO），這是一個將向量場流匹配概念映射到離散標記空間的原生框架。具體而言，LFPO將對齊問題表述為幾何速度校正，通過對比更新直接優化去噪邏輯值。此設計有效繞過似然近似固有的誤差，實現精確的梯度估計。此外，LFPO通過從中間步驟預測最終解來強化一致性，有效拉直概率流路徑，從而能以顯著更少的迭代次數實現高質量生成。大量實驗表明，LFPO不僅在代碼和推理基準測試中超越現有最先進基線模型，更通過減少擴散步驟將推理速度提升約20%。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO：基於遮罩擴散模型的無似然策略優化

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

摘要

Support