信任区域Q伴随匹配
Trust Region Q Adjoint Matching
May 26, 2026
作者: Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin
cs.AI
摘要
离策略强化学习对预训练流策略进行微调仍面临挑战,其根源在于多步采样过程导致优化不稳定。近期,伴随匹配Q学习(QAM)通过将问题转化为带学习评论家的无记忆随机最优控制(SOC)问题解决了这一难题。然而,QAM继承了评论家引导改进的固有脆弱性:当评论家处于病态时,微小的评论家误差会被放大,常导致模型崩塌。本文提出信任域伴随匹配Q学习(TRQAM)——一种稳定的离策略微调算法,通过投影对偶下降自适应控制预训练流策略的路径空间KL散度。具体而言,我们优化SOC动力学中的信任域参数λ,并从理论上证明路径空间KL散度可由λ的闭式函数表示。由此,本方法能精确控制与预训练流策略的偏差量,实现稳定的离策略强化学习。在50项OGBench任务上的实验表明,TRQAM在离线强化学习和离线到在线强化学习场景中均持续超越现有方法。特别地,TRQAM在离线强化学习中实现了68%的整体成功率,较最强基线(46%)取得显著提升。
English
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.