LFPO：面向掩码扩散模型的无似然策略优化

摘要

可验证奖励强化学习（RLVR）在提升自回归模型方面取得了显著成功，尤其在数学推理和代码生成等需要正确性的领域表现突出。然而，由于精确似然计算的难处理性，直接将此类范式应用于扩散大语言模型（dLLM）存在根本性障碍，这迫使现有方法只能依赖高方差近似。为弥合这一差距，我们提出了无似然策略优化（LFPO）——一种将向量场流匹配概念映射到离散词元空间的原生框架。具体而言，LFPO将对齐问题表述为几何速度校正，通过对比更新直接优化去噪逻辑值。该设计有效规避了似然近似固有的误差，实现了精确的梯度估计。此外，LFPO通过从中间步骤预测最终解来强化一致性，有效拉直概率流路径，从而以显著更少的迭代次数实现高质量生成。大量实验表明，LFPO不仅在代码和推理基准测试中超越现有最优基线，还通过减少扩散步骤将推理速度提升约20%。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

LFPO：面向掩码扩散模型的无似然策略优化

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

摘要

Support