Trust-Region-Q-Adjungierten-Matching

Zusammenfassung

Off-Policy-Verstärkungslernen von vortrainierten Fluss-Policys bleibt aufgrund der Instabilität der Optimierung, die aus dem mehrschrittigen Abtastprozess resultiert, herausfordernd. Kürzlich hat Q-Learning mit Adjoint Matching (QAM) dieses Problem adressiert, indem es die Problemstellung in ein gedächtnisloses stochastisches Optimalsteuerungsproblem (SOC) mit einem gelernten Kritiker umformuliert. Allerdings erbt QAM eine grundlegende Fragilität der kritikergeleiteten Verbesserung: Kleine Kritikerfehler werden verstärkt, wenn Kritiker schlecht konditioniert sind, was häufig zu einem Modellkollaps führt. In dieser Arbeit wird Trust Region Q-Adjoint Matching (TRQAM) vorgestellt, ein stabiles Off-Policy-Feinabstimmungsalgorithmus, der mittels projiziertem dualen Abstieg den Pfadraum-KL-Abstand zu vortrainierten Fluss-Policys adaptiv kontrolliert. Insbesondere optimieren wir den Trust-Region-Parameter λ in der SOC-Dynamik und zeigen theoretisch, dass der Pfadraum-KL-Abstand durch eine geschlossene Funktion von λ dargestellt werden kann. Dadurch kann unsere Methode die exakte Abweichung von vortrainierten Fluss-Policys präzise kontrollieren und so ein stabiles Off-Policy-RL erreichen. In Experimenten mit 50 OGBench-Aufgaben übertrifft TRQAM durchgängig den bisherigen Stand der Technik sowohl im Offline-RL als auch im Offline-zu-Online-RL. Insbesondere erreicht TRQAM eine Gesamterfolgsrate von 68% im Offline-RL und verbessert damit die stärkste Baseline von 46% erheblich.

English

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.