FlowRL：面向大语言模型推理的奖励分布匹配

摘要

我们提出FlowRL：通过流平衡匹配完整的奖励分布，而非在大语言模型（LLM）强化学习（RL）中单纯最大化奖励。近期先进的推理模型采用奖励最大化方法（如PPO和GRPO），这些方法倾向于过度优化主导奖励信号，而忽视了虽不频繁但有效的推理路径，从而降低了多样性。相比之下，我们利用可学习的划分函数将标量奖励转化为归一化的目标分布，随后最小化策略与目标分布之间的反向KL散度。我们将这一理念实现为一种流平衡优化方法，旨在促进多样化的探索和可泛化的推理轨迹。我们在数学和代码推理任务上进行了实验：FlowRL在数学基准测试中相比GRPO平均提升10.0%，相比PPO提升5.1%，并在代码推理任务上持续表现更优。这些结果凸显了奖励分布匹配作为LLM强化学习中实现高效探索和多样化推理的关键一步。

English

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

FlowRL：面向大语言模型推理的奖励分布匹配

FlowRL: Matching Reward Distributions for LLM Reasoning

摘要

Support