基于价值梯度流的强化学习

摘要

我们研究行为正则化强化学习（RL），其中对参考分布（离线RL中的数据集或LLM RL微调中的基模型）的正则化对于防止因分布外错误外推导致的价值过优化至关重要。现有方法要么依赖难以扩展至大型生成模型的重新参数化策略梯度，要么采用在尝试超越行为支持集时可能过于保守的拒绝采样。本文提出价值梯度流（VGF），一种可扩展的行为正则化RL新范式。VGF将行为正则化RL转化为将参考分布映射至价值诱导最优策略分布的最优传输问题。我们通过离散梯度流求解该传输问题，其中价值梯度引导从参考分布初始化的粒子流动。理论分析表明VGF通过控制传输预算实现隐式正则化。该方法在保持表达力和灵活性的同时消除了显式策略参数化需求，从而可通过调整传输预算实现自适应测试时缩放。大量实验表明，VGF显著优于现有方法，在离线RL基准（D4RL、OGBench）和LLM RL任务中达到最先进水平。代码与运行结果详见https://ryanxhr.github.io/vgf。

English

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.