FlowRL: LLM推論における報酬分布のマッチング

要旨

我々はFlowRLを提案する：大規模言語モデル（LLM）の強化学習（RL）において、報酬の最大化ではなく、フローバランスを通じて完全な報酬分布を一致させる手法である。最近の高度な推論モデルは報酬最大化手法（例：PPOやGRPO）を採用しているが、これらは支配的な報酬信号を過剰に最適化し、頻度は低いが有効な推論経路を無視する傾向があり、多様性を低下させている。対照的に、我々はスカラー報酬を学習可能な分割関数を用いて正規化された目標分布に変換し、ポリシーと目標分布の間の逆KLダイバージェンスを最小化する。このアイデアを、多様な探索と汎用的な推論軌跡を促進するフローバランス最適化手法として実装する。数学とコード推論タスクで実験を行った結果、FlowRLは数学ベンチマークでGRPOに対して10.0%、PPOに対して5.1%の平均的な改善を達成し、コード推論タスクでも一貫して優れた性能を示した。これらの結果は、LLM強化学習における効率的な探索と多様な推論に向けた鍵となるステップとして、報酬分布のマッチングが重要であることを強調している。

English

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

FlowRL: LLM推論における報酬分布のマッチング

FlowRL: Matching Reward Distributions for LLM Reasoning

要旨

Support