ChatPaper.aiChatPaper

邊界引導策略優化:實現擴散式大型語言模型記憶高效的強化學習

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

October 13, 2025
作者: Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
cs.AI

摘要

在將強化學習(RL)應用於擴散大型語言模型(dLLMs)時,一個關鍵挑戰在於其似然函數的難以處理性,這些函數對於RL目標至關重要,因此需要在每個訓練步驟中進行相應的近似。雖然現有方法通過定制的蒙特卡羅(MC)採樣來近似對數似然的下界(ELBO),但所有MC樣本的前向計算圖都需要保留,以便計算RL目標中非線性項的梯度,這導致了顯著的記憶體開銷。這一限制使得可行的樣本規模受限,從而導致似然近似不精確,最終扭曲了RL目標。為克服這一限制,我們提出了邊界引導策略優化(BGPO),這是一種記憶體高效的RL算法,它最大化了一個特別構建的基於ELBO目標的下界。該下界經過精心設計,滿足兩個關鍵特性:(1)線性性:它以線性總和的形式表達,其中每項僅依賴於單個MC樣本,從而實現跨樣本的梯度累積並確保恆定的記憶體使用;(2)等價性:在策略訓練中,該下界的值和梯度與基於ELBO的目標相等,使其也成為原始RL目標的有效近似。這些特性使得BGPO能夠採用較大的MC樣本規模,從而實現更精確的似然近似和改進的RL目標估計,進而提升性能。實驗表明,BGPO在數學問題求解、代碼生成和規劃任務中顯著優於先前的dLLMs RL算法。
English
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
PDF122October 15, 2025