GoRL:一种与算法无关的生成策略在线强化学习框架
GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies
December 2, 2025
作者: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An
cs.AI
摘要
强化学习(RL)长期面临一个核心矛盾:易于优化的策略往往过于简单,难以表达复杂控制所需的多模态动作分布。高斯策略具有易处理的似然函数和平滑梯度,但其单模态形式限制了表达能力。相反,基于扩散或流匹配的生成式策略能模拟丰富的多模态行为,但在在线强化学习中,由于难以处理的似然函数和深度采样链传播的噪声梯度,它们常常表现不稳定。我们通过一个关键结构原则解决这一矛盾:将优化过程与生成过程解耦。基于这一思路,我们提出GoRL(生成式在线强化学习)框架,该框架通过优化易处理的潜策略,同时利用条件生成解码器合成动作。采用双时间尺度更新机制,使得潜策略能够稳定学习,而解码器则在不要求动作似然可处理的条件下持续提升表达能力。在系列连续控制任务中,GoRL始终优于高斯策略和近期生成式策略基线。尤其在HopperStand任务中,其归一化回报超过870分,达到最强基线性能的3倍以上。这些结果表明,将优化与生成分离为实现既稳定又高表达能力的策略提供了一条可行路径。
English
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.