強化學習中流策略的測試時梯度引導

摘要

表達式連續控制策略（如擴散模型與流模型）是近期在模擬與真實機器人控制中，拓展模仿學習規模化的核心基礎。雖然這類策略在監督式模仿學習環境中展現出穩定的擴展性，但要將其整合至強化學習（RL）流程中進行策略改善，卻被證實相當困難。這往往需要專門的訓練目標函數，或透過去噪過程進行反向傳遞，而這些方法會引發眾所周知的穩定性問題，進而影響可擴展性。在本論文中，我們探討一個問題：僅在測試階段採用簡單的策略改善機制，同時保留穩定的監督式策略訓練，是否能成為避開上述問題的競爭性替代方案？為此，我們提出QGF（Q引導流）——一種完全在測試階段進行策略最佳化的強化學習演算法。QGF的運作方式是先預訓練一個參考流策略（透過標準的行為複製目標）與一個價值函數評論家；在測試階段，則利用價值梯度來引導參考策略，使其產生更高價值的動作，而無需進行額外的策略學習。實驗結果顯示，QGF在先前的測試階段強化學習方法中，於高維度動作空間的單任務與目標條件離線強化學習基準測試中表現更佳，且與最先進的訓練階段演算法相比，競爭力相當，同時運行成本更低。此外，透過避免演員-評論家訓練的不穩定性，QGF在模型規模擴展時展現出良好的線性成長特性，為採用表達式策略的強化學習提供了一個實用且有效的替代演算法。

English

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.