ChatPaper.aiChatPaper

GoRL:基於生成策略的線上強化學習演算法無關框架

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

December 2, 2025
作者: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An
cs.AI

摘要

強化學習(RL)長期面臨一個矛盾:易於優化的策略往往過於簡單,無法表徵複雜控制所需的多模態動作分佈。高斯策略雖能提供易處理的似然函數和平滑梯度,但其單模態形式限制了表達能力。反之,基於擴散或流匹配的生成式策略能建模豐富的多模態行為,但在線上強化學習中,由於深層採樣鏈導致的難解似然函數和噪聲梯度,這類策略常出現不穩定問題。我們通過一個關鍵結構性原則解決此矛盾:將優化過程與生成過程解耦。基於此洞見,我們提出GoRL(生成式線上強化學習)框架,該框架在優化一個易處理的潛在策略的同時,利用條件生成解碼器來合成動作。通過雙時間尺度更新機制,潛在策略可穩定學習,而解碼器能持續提升表達能力,且無需依賴可處理的動作似然函數。在一系列連續控制任務中,GoRL始終優於高斯策略與近期生成式策略基線。尤其在HopperStand任務中,其歸一化回報超過870分,達到最強基線性能的3倍以上。這些結果表明,將優化與生成分離為實現兼具穩定性與高表達能力的策略提供了一條可行路徑。
English
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.
PDF152January 23, 2026