強化学習におけるフロー方策のテスト時勾配ガイダンス

要旨

表現力豊かな連続制御ポリシー（拡散モデルやフローモデルなど）は、シミュレーション環境や実ロボット制御における模倣学習のスケーリングに関する最近の進歩の基盤を形成している。これらは教師あり模倣学習の設定では安定的にスケールすることが知られているが、強化学習（RL）パイプラインに組み込んでポリシー改善を行うことは、これまで困難であることが示されてきた。多くの場合、特殊な学習目的関数やデノイジングプロセスを通した逆伝播が必要となり、それが安定性に悪影響を及ぼし、スケーラビリティを損なう原因となる。本稿では、安定した教師ありポリシー学習をそのまま維持しつつ、テスト時のみに単純なポリシー改善手法を適用するだけで、こうした問題を回避する競争力のある代替手段となり得るのかを研究する。そのために、我々はQGF（Q-Guided Flow）を提案する。これは、ポリシー最適化をテスト時のみで実行するRLアルゴリズムである。QGFは、標準的な行動模倣目的関数を用いて参照フローポリシーを事前学習するとともに、価値関数クリティックも事前学習しておき、テスト時には価値勾配を利用して参照ポリシーをガイドすることで、追加のポリシー学習を一切行わずに、より高い価値を持つ行動を生成する。実証評価において、QGFは高次元行動空間を持つ単一タスクおよび目標条件付きオフラインRLベンチマークにおいて、既存のテスト時RL手法を上回り、最先端の学習時アルゴリズムと競合しつつ、実行コストは大幅に低い。さらに、アクタークリティック学習の不安定性を回避することで、モデルサイズに対して好ましいスケーリング特性を示し、表現力豊かなポリシーを用いた実用的かつ効果的な代替RLアルゴリズムを提供する。

English

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.