FlowRL: LLM 추론을 위한 보상 분포 매칭

초록

우리는 대규모 언어 모델(LLM) 강화 학습(RL)에서 보상을 극대화하는 대신 보상 분포 전체를 매칭하는 FlowRL을 제안합니다. 최근의 고급 추론 모델들은 PPO와 GRPO와 같은 보상 극대화 방법을 채택하고 있는데, 이 방법들은 지배적인 보상 신호를 과도하게 최적화하면서 덜 빈번하지만 유효한 추론 경로를 소홀히 하여 다양성을 감소시키는 경향이 있습니다. 이에 반해, 우리는 스칼라 보상을 학습 가능한 분할 함수를 사용하여 정규화된 목표 분포로 변환한 후, 정책과 목표 분포 간의 역 KL 발산을 최소화합니다. 우리는 이러한 아이디어를 다양한 탐색과 일반화 가능한 추론 궤적을 촉진하는 흐름 균형 최적화 방법으로 구현합니다. 수학 및 코드 추론 과제에 대한 실험을 수행한 결과, FlowRL은 수학 벤치마크에서 GRPO 대비 평균 10.0%, PPO 대비 5.1%의 유의미한 성능 향상을 달성했으며, 코드 추론 과제에서도 일관되게 더 나은 성능을 보였습니다. 이러한 결과는 LLM 강화 학습에서 효율적인 탐색과 다양한 추론을 위한 핵심 단계로서 보상 분포 매칭의 중요성을 강조합니다.

English

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

FlowRL: LLM 추론을 위한 보상 분포 매칭

FlowRL: Matching Reward Distributions for LLM Reasoning

초록

Support