FlowRL：匹配大語言模型推理的獎勵分佈

摘要

我們提出FlowRL：通過流量平衡來匹配完整的獎勵分佈，而非在大語言模型（LLM）強化學習（RL）中單純最大化獎勵。近期先進的推理模型採用了獎勵最大化方法（例如PPO和GRPO），這些方法往往過度優化主導的獎勵信號，而忽視了出現頻率較低但有效的推理路徑，從而降低了多樣性。與此相反，我們利用可學習的分割函數將標量獎勵轉化為歸一化的目標分佈，然後最小化策略與目標分佈之間的反向KL散度。我們將這一理念實現為一種流量平衡的優化方法，以促進多樣化的探索和可泛化的推理軌跡。我們在數學和代碼推理任務上進行了實驗：FlowRL在數學基準測試中相比GRPO平均提升了10.0%，相比PPO提升了5.1%，並且在代碼推理任務上表現始終更優。這些結果凸顯了獎勵分佈匹配作為LLM強化學習中實現高效探索和多樣化推理的關鍵一步。

English

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

FlowRL：匹配大語言模型推理的獎勵分佈

FlowRL: Matching Reward Distributions for LLM Reasoning

摘要

Support