强化学习中流策略的测试时梯度引导

摘要

表达能力强的连续控制策略（如扩散模型与流模型）是近年来模拟和真实机器人控制中可扩展模仿学习取得进展的基础。尽管这些策略在监督式模仿学习场景中表现出稳定的扩展性，但将其融入强化学习管道进行策略改进却困难重重。这通常需要专门设计的训练目标或通过去噪过程进行反向传播，而这些方法会引发稳定性问题并影响可扩展性。本文研究了一个核心问题：是否仅通过测试时的简单策略改进方案（同时保持稳定的监督式策略训练不变）就能成为规避这些问题的竞争性替代方案。为此，我们提出QGF（Q引导流）——一种完全在测试时进行策略优化的强化学习算法。QGF通过预训练参考流策略（基于标准行为克隆目标）和值函数评论员，在测试时利用值梯度引导参考策略生成更高价值的动作，而无需额外进行策略学习。实验表明，在具有高维动作空间的单任务与目标条件离线强化学习基准中，QGF的性能优于先前的测试时强化学习方法，且与最先进的训练时算法性能相当，同时运行成本更低。此外，通过避免演员-评论员训练的不稳定性，QGF展现出与模型规模正向扩展的优势，为使用表达能力强的策略提供了一种实用且高效的强化学习替代方案。

English

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.