Versterkend Leren via Waardegradiëntstroom

Samenvatting

Wij bestuderen gedrags-geregulariseerde reinforcement learning (RL), waarbij regularisatie naar een referentiedistributie (de dataset in offline RL of het basismodel in LLM RL-finetuning) essentieel is om waardeverbetering door foutieve extrapolatie buiten de distributie te voorkomen. Bestaande methodes zijn ofwel gebaseerd op gereparameteriseerd beleidsgradiënten, die moeilijk te schalen zijn naar grote generatieve modellen, ofwel op reject sampling, wat te conservatief kan zijn bij pogingen om buiten de ondersteuning van het gedrag te treden. In dit artikel stellen wij Value Gradient Flow (VGF) voor, een nieuwe, schaalbare paradigma voor gedrags-geregulariseerde RL. VGF beschouwt gedrags-geregulariseerde RL als een optimaal transportprobleem dat de referentiedistributie afbeeldt op de door de waarde geïnduceerde optimale beleidsdistributie. Wij lossen dit transportprobleem op via discrete gradiëntstroming, waarbij waardegradiënten deeltjes sturen die geïnitialiseerd zijn vanuit de referentiedistributie. Onze analyse toont aan dat VGF regularisatie impliciet oplegt door het transportbudget te controleren. VGF elimineert expliciete beleidsparameterisatie terwijl het expressief en flexibel blijft, wat adaptieve schaling tijdens testen mogelijk maakt door het transportbudget aan te passen. Uitgebreide experimenten tonen aan dat VGF aanzienlijk beter presteert dan eerdere methodes en state-of-the-art resultaten behaalt op offline RL-benchmarks (D4RL, OGBench) en LLM RL-taken. Code en runs zijn te vinden op https://ryanxhr.github.io/vgf.

English

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.

Versterkend Leren via Waardegradiëntstroom

Reinforcement Learning via Value Gradient Flow

Samenvatting

Support