ChatPaper.aiChatPaper

FlowRL:匹配大語言模型推理的獎勵分佈

FlowRL: Matching Reward Distributions for LLM Reasoning

September 18, 2025
作者: Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
cs.AI

摘要

我們提出FlowRL:通過流量平衡來匹配完整的獎勵分佈,而非在大語言模型(LLM)強化學習(RL)中單純最大化獎勵。近期先進的推理模型採用了獎勵最大化方法(例如PPO和GRPO),這些方法往往過度優化主導的獎勵信號,而忽視了出現頻率較低但有效的推理路徑,從而降低了多樣性。與此相反,我們利用可學習的分割函數將標量獎勵轉化為歸一化的目標分佈,然後最小化策略與目標分佈之間的反向KL散度。我們將這一理念實現為一種流量平衡的優化方法,以促進多樣化的探索和可泛化的推理軌跡。我們在數學和代碼推理任務上進行了實驗:FlowRL在數學基準測試中相比GRPO平均提升了10.0%,相比PPO提升了5.1%,並且在代碼推理任務上表現始終更優。這些結果凸顯了獎勵分佈匹配作為LLM強化學習中實現高效探索和多樣化推理的關鍵一步。
English
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
PDF1036September 19, 2025