Trust-Regio Q-Adjunct Matching

Samenvatting

Off-policy reinforcement learning van voorgetrainde flow-beleidsregels blijft uitdagend vanwege de instabiliteit van de optimalisatie die voortkomt uit het meerstapsbemonsteringsproces. Recentelijk heeft Q-leren met Adjoint Matching (QAM) dit probleem aangepakt door het te herformuleren tot een geheugenloos stochastisch optimaal controle (SOC) probleem met een geleerde criticus. Echter, QAM erft een fundamentele kwetsbaarheid van criticus-geleide verbetering: kleine fouten van de criticus worden versterkt wanneer criticussen slecht geconditioneerd zijn, wat vaak leidt tot modelinstorting. Dit artikel introduceert Trust Region Q-Adjoint Matching (TRQAM), een stabiel off-policy fijnafstemmingsalgoritme dat adaptief de padruimte-KL regelt met voorgetrainde flow-beleidsregels via geprojecteerde duale afdaling. Specifiek optimaliseren we de vertrouwensgebiedparameter λ in SOC-dynamica, en tonen we theoretisch aan dat de padruimte-KL kan worden weergegeven door een gesloten-vorm functie van λ. Hierdoor kan onze methode de exacte afwijking van voorgetrainde flow-beleidsregels precies controleren, wat leidt tot stabiel off-policy RL. Door middel van experimenten op 50 OGBench-taken presteert TRQAM consequent beter dan eerdere technieken in zowel offline RL als offline-naar-online RL. In het bijzonder bereikt TRQAM een algemeen succespercentage van 68% in offline RL, wat een aanzienlijke verbetering is ten opzichte van de sterkste basislijn van 46%.

English

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.