신뢰 영역 Q 수반 매칭

초록

사전 학습된 플로우 정책의 오프-폴리시 강화 학습은 다단계 샘플링 과정에서 발생하는 최적화의 불안정성으로 인해 여전히 어려운 과제로 남아 있다. 최근, Q-러닝과 어조인트 매칭(QAM)은 학습된 크리틱을 사용하여 이를 무기억 확률적 최적 제어(SOC) 문제로 재구성함으로써 이 문제를 해결하였다. 그러나 QAM은 크리틱 기반 개선의 근본적인 취약성을 계승하는데, 크리틱이 조건이 나쁠 경우 작은 크리틱 오차가 증폭되어 종종 모델 붕괴로 이어진다. 본 논문은 투영 쌍대 하강법을 통해 사전 학습된 플로우 정책과의 경로 공간 KL을 적응적으로 제어하는 안정적인 오프-폴리시 미세 조정 알고리즘인 신뢰 영역 Q-어조인트 매칭(TRQAM)을 소개한다. 구체적으로, 우리는 SOC 동역학에서 신뢰 영역 매개변수 λ를 최적화하고, 경로 공간 KL이 λ의 폐쇄형 함수로 표현될 수 있음을 이론적으로 보인다. 결과적으로, 우리의 방법은 사전 학습된 플로우 정책으로부터의 정확한 편차를 정밀하게 제어할 수 있어 안정적인 오프-폴리시 RL을 달성한다. 50개의 OGBench 작업에 대한 실험을 통해, TRQAM은 오프라인 RL과 오프라인-투-온라인 RL 모두에서 기존 방법들을 일관되게 능가한다. 특히, TRQAM은 오프라인 RL에서 전체 성공률 68%를 달성하여, 가장 강력한 기준선인 46%를 크게 개선한다.

English

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.