边界引导策略优化:面向扩散大语言模型的高效记忆强化学习
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
October 13, 2025
作者: Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
cs.AI
摘要
在将强化学习(RL)应用于扩散大语言模型(dLLMs)时,一个关键挑战在于其似然函数的不可处理性,而这对RL目标至关重要,因此需要在每个训练步骤中进行相应的近似。现有方法通过定制的蒙特卡罗(MC)采样,利用证据下界(ELBO)来近似对数似然,但所有MC样本的前向计算图需保留以计算RL目标中非线性项的梯度,这导致了显著的内存开销。这一限制使得可行的样本规模受限,进而导致似然近似不精确,最终扭曲了RL目标。为克服这一局限,我们提出了边界引导策略优化(BGPO),这是一种内存高效的RL算法,它最大化了一个特别构建的基于ELBO目标的下界。该下界精心设计以满足两个关键特性:(1)线性性:它以线性求和形式表达,其中每一项仅依赖于单个MC样本,从而实现了跨样本的梯度累积,并确保了恒定的内存使用;(2)等价性:在策略训练中,该下界的值和梯度均与基于ELBO的目标相等,使其也成为原始RL目标的有效近似。这些特性使得BGPO能够采用较大的MC样本规模,从而获得更精确的似然近似和更优的RL目标估计,进而提升性能。实验表明,在数学问题求解、代码生成及规划任务中,BGPO显著优于以往的dLLMs RL算法。
English
A key challenge in applying reinforcement learning (RL) to diffusion large
language models (dLLMs) lies in the intractability of their likelihood
functions, which are essential for the RL objective, necessitating
corresponding approximation in each training step. While existing methods
approximate the log-likelihoods by their evidence lower bounds (ELBOs) via
customized Monte Carlo (MC) sampling, the forward computational graphs of all
MC samples need to be retained for the gradient computation of non-linear terms
in the RL objective, resulting in significant memory overhead. This constraint
restricts feasible sample sizes, leading to imprecise likelihood approximations
and ultimately distorting the RL objective. To overcome this limitation, we
propose Boundary-Guided Policy Optimization (BGPO), a memory-efficient
RL algorithm that maximizes a specially constructed lower bound of the
ELBO-based objective. This lower bound is carefully designed to satisfy two key
properties: (1) Linearity: it is formulated in a linear sum where each term
depends only on a single MC sample, thereby enabling gradient accumulation
across samples and ensuring constant memory usage; (2) Equivalence: Both the
value and gradient of this lower bound are equal to those of the ELBO-based
objective in on-policy training, making it also an effective approximation for
the original RL objective. These properties allow BGPO to adopt a large MC
sample size, resulting in more accurate likelihood approximations and improved
RL objective estimation, which in turn leads to enhanced performance.
Experiments show that BGPO significantly outperforms previous RL algorithms for
dLLMs in math problem solving, code generation, and planning tasks.