信任区域Q伴随匹配

摘要

离策略强化学习对预训练流策略进行微调仍面临挑战，其根源在于多步采样过程导致优化不稳定。近期，伴随匹配Q学习（QAM）通过将问题转化为带学习评论家的无记忆随机最优控制（SOC）问题解决了这一难题。然而，QAM继承了评论家引导改进的固有脆弱性：当评论家处于病态时，微小的评论家误差会被放大，常导致模型崩塌。本文提出信任域伴随匹配Q学习（TRQAM）——一种稳定的离策略微调算法，通过投影对偶下降自适应控制预训练流策略的路径空间KL散度。具体而言，我们优化SOC动力学中的信任域参数λ，并从理论上证明路径空间KL散度可由λ的闭式函数表示。由此，本方法能精确控制与预训练流策略的偏差量，实现稳定的离策略强化学习。在50项OGBench任务上的实验表明，TRQAM在离线强化学习和离线到在线强化学习场景中均持续超越现有方法。特别地，TRQAM在离线强化学习中实现了68%的整体成功率，较最强基线（46%）取得显著提升。

English

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.