Aprendizagem por Reforço via Fluxo do Gradiente de Valor

Resumo

Estudamos o aprendizado por reforço com regularização comportamental (RL), onde a regularização em direção a uma distribuição de referência (o conjunto de dados no RL offline ou o modelo base no ajuste fino de LLM via RL) é essencial para evitar a superotimização de valor causada pela extrapolação errônea fora da distribuição. Os métodos existentes dependem ou de gradientes de política reparametrizados, que são difíceis de dimensionar para grandes modelos generativos, ou de amostragem por rejeição, que pode ser excessivamente conservadora ao tentar ir além do suporte comportamental. Neste artigo, propomos o Fluxo de Gradiente de Valor (VGF), um novo paradigma escalável para RL com regularização comportamental. O VGF enquadra o RL regularizado comportamental como um problema de transporte ótimo que mapeia a distribuição de referência para a distribuição de política ótima induzida pelo valor. Resolvemos este problema de transporte via fluxo de gradiente discreto, onde os gradientes de valor guiam partículas inicializadas a partir da distribuição de referência. Nossa análise mostra que o VGF impõe a regularização implicitamente ao controlar o orçamento de transporte. O VGF elimina a parametrização explícita da política, mantendo-se expressivo e flexível, o que permite um dimensionamento adaptativo durante o teste ao ajustar o orçamento de transporte. Experimentos extensivos demonstram que o VGF supera significativamente métodos anteriores, alcançando resultados state-of-the-art em benchmarks de RL offline (D4RL, OGBench) e tarefas de RL para LLM. O código e execuções podem ser encontrados em https://ryanxhr.github.io/vgf.

English

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.

Aprendizagem por Reforço via Fluxo do Gradiente de Valor

Reinforcement Learning via Value Gradient Flow

Resumo

Support