강화 학습에서 플로우 정책의 테스트 시간 그래디언트 가이던스

초록

표현적 연속 제어 정책(예: 확산 및 플로우 모델)은 시뮬레이션 및 실제 로봇 제어를 위한 모방 학습의 최근 발전을 뒷받침하는 핵심 요소입니다. 이러한 정책은 지도 기반 모방 학습 환경에서 안정적으로 확장되는 것으로 알려져 있지만, 정책 개선을 위해 강화 학습(RL) 파이프라인에 통합하는 것은 더 어려운 것으로 입증되었습니다. 이는 종종 특수한 훈련 목적 함수나 노이즈 제거 과정을 통한 역전파를 필요로 하며, 이는 안정성 문제와 확장성에 영향을 미치는 잘 알려진 문제를 야기합니다. 본 논문에서는 안정적인 지도 기반 정책 훈련을 유지한 채, 테스트 시점에만 간단한 정책 개선 기법을 사용하는 것이 이러한 문제를 우회하는 경쟁력 있는 대안이 될 수 있는지 연구합니다. 이를 위해 우리는 QGF(Q-Guided Flow)를 제안합니다. QGF는 정책 최적화를 전적으로 테스트 시점에 수행하는 RL 알고리즘입니다. QGF는 기준 플로우 정책(표준 행동 복제 목적 함수를 통해)과 가치 함수 비평자를 사전 훈련한 후, 테스트 시점에 가치 그래디언트를 사용하여 기준 정책을 안내함으로써 추가적인 정책 학습 없이 더 높은 가치의 행동을 생성합니다. 실험적으로, QGF는 고차원 행동 공간을 가진 단일 작업 및 목표 조건부 오프라인 RL 벤치마크에서 이전의 테스트 시점 RL 방법보다 성능이 뛰어나며, 최신 훈련 시점 알고리즘과 경쟁력 있으면서도 실행 비용이 훨씬 저렴합니다. 또한, 행위자-비평가 훈련의 불안정성을 피함으로써 모델 크기에 따른 확장성에서 유리한 특성을 보여, 표현적 정책을 사용하는 실용적이고 효과적인 대안 RL 알고리즘을 제공합니다.

English

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.