Testtijd-gradiëntsturing van flow-beleid in bekrachtigingsleren

Samenvatting

Expressieve continue controlebeleidsvormen, zoals diffusie- en stromingsmodellen, vormen de ruggengraat van recente vooruitgang in het opschalen van imitatieleren voor gesimuleerde en echte robotbesturing. Hoewel ze bekend staan om stabiel op te schalen in de gesuperviseerde imitatieleromgeving, is het integreren ervan in versterkend leren (RL)-pijplijnen voor beleidsverbetering moeilijker gebleken. Het vereist vaak gespecialiseerde trainingsdoelen of het terugpropageren door ontruisingsprocessen, wat bekende stabiliteitsproblemen veroorzaakt en de schaalbaarheid beïnvloedt. In dit artikel bestuderen we de vraag of eenvoudige beleidsverbeteringsschema's alleen tijdens de testtijd, waarbij de stabiele gesuperviseerde beleidstraining intact blijft, een concurrerend alternatief kunnen zijn dat deze problemen omzeilt. Daartoe stellen we QGF (Q-gestuurde stroom) voor, een RL-algoritme dat beleidsoptimalisatie volledig tijdens de testtijd uitvoert. QGF werkt door zowel een referentiestroombeleid (via een standaard gedragsklonendoel) als een waardefunctiecritic voor te trainen en tijdens de testtijd de waarde gradiënt te gebruiken om het referentiebeleid te sturen naar het genereren van acties met hogere waarde, zonder enige extra beleidsleren. Empirisch gezien presteert QGF beter dan eerdere testtijd-RL-methoden op single-task en goal-conditioned offline RL-benchmarks met hoogdimensionale actieruimten, en is het concurrerend met state-of-the-art trainingstijdalgoritmen, terwijl het veel goedkoper is om uit te voeren. Bovendien vertoont het gunstige schaalbaarheid met modelgrootte door het vermijden van de instabiliteit van actor-critic training, wat een praktisch en effectief alternatief RL-algoritme biedt met expressieve beleidsvormen.

English

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.