反思大语言模型强化学习中的信任区域
Rethinking the Trust Region in LLM Reinforcement Learning
February 4, 2026
作者: Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee
cs.AI
摘要
强化学习(RL)已成为微调大语言模型(LLM)的核心技术,其中近端策略优化(PPO)作为事实标准算法被广泛采用。尽管PPO应用普遍,我们认为其核心的比例裁剪机制在结构上难以适配LLM固有的大词汇表特性。PPO基于采样标记的概率比来约束策略更新,该比率作为真实策略散度的噪声单样本蒙特卡洛估计。这种机制形成了次优的学习动态:对低概率标记的更新会遭受过度惩罚,而高概率标记中可能出现的灾难性偏移却约束不足,导致训练效率低下和稳定性问题。为此,我们提出散度近端策略优化(DPPO),用基于策略散度直接估计(如总变差或KL散度)的原理性约束替代启发式裁剪。为避免巨大内存开销,我们引入高效的二元与Top-K近似法,以可忽略的开销捕捉核心散度。大量实证评估表明,相比现有方法,DPPO在训练稳定性和效率上表现更优,为基于RL的LLM微调提供了更稳健的基础。
English
Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.