ChatPaper.aiChatPaper

FlowRL:面向大语言模型推理的奖励分布匹配

FlowRL: Matching Reward Distributions for LLM Reasoning

September 18, 2025
作者: Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
cs.AI

摘要

我们提出FlowRL:通过流平衡匹配完整的奖励分布,而非在大语言模型(LLM)强化学习(RL)中单纯最大化奖励。近期先进的推理模型采用奖励最大化方法(如PPO和GRPO),这些方法倾向于过度优化主导奖励信号,而忽视了虽不频繁但有效的推理路径,从而降低了多样性。相比之下,我们利用可学习的划分函数将标量奖励转化为归一化的目标分布,随后最小化策略与目标分布之间的反向KL散度。我们将这一理念实现为一种流平衡优化方法,旨在促进多样化的探索和可泛化的推理轨迹。我们在数学和代码推理任务上进行了实验:FlowRL在数学基准测试中相比GRPO平均提升10.0%,相比PPO提升5.1%,并在代码推理任务上持续表现更优。这些结果凸显了奖励分布匹配作为LLM强化学习中实现高效探索和多样化推理的关键一步。
English
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
PDF1036September 19, 2025