信頼領域Q随伴マッチング

要旨

オフポリシー強化学習による事前学習済みフローポリシーの微調整は、多段階サンプリング過程に起因する最適化の不安定性から依然として困難を伴う。近年、随伴マッチングを用いたQ学習（QAM）は、学習済み批評家を用いて無記憶確率的最適制御（SOC）問題へと再定式化することでこの問題に対処した。しかしながら、QAMは批評家誘導型改善に固有の脆弱性を引き継いでいる。すなわち、批評家が不良設定である場合、小さな批評家誤差が増幅され、しばしばモデル崩壊を引き起こす。本論文では、射影二重降下法により事前学習済みフローポリシーとの経路空間KLを適応的に制御する安定なオフポリシー微調整アルゴリズムである、信頼領域Q随伴マッチング（TRQAM）を提案する。具体的には、SOCダイナミクスにおける信頼領域パラメータλを最適化し、経路空間KLがλの閉形式関数で表現できることを理論的に示す。これにより、本手法は事前学習済みフローポリシーからの正確な乖離を精密に制御し、安定なオフポリシーRLを実現する。OGBenchの50タスクにおける実験を通じて、TRQAMはオフラインRLおよびオフラインからオンラインへのRLの両方において、従来手法を一貫して上回る性能を示した。特に、TRQAMはオフラインRLにおいて全体成功率68%を達成し、最強のベースラインである46%を大幅に改善した。

English

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.