ExGRPO: 경험으로부터 추론 학습하기

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 대규모 언어 모델의 추론 능력을 향상시키기 위한 새로운 패러다임으로 부상하고 있습니다. 그러나 표준 온-폴리시(on-policy) 학습은 롤아웃 경험을 단일 업데이트 후 폐기하므로, 계산 비효율성과 불안정성을 초래합니다. 기존 강화 학습 연구에서는 과거 경험을 재사용하는 이점을 강조했지만, 대규모 추론 모델의 학습 동학에 미치는 경험 특성의 역할은 아직 충분히 탐구되지 않았습니다. 본 논문에서는 어떤 추론 경험이 가치 있는지를 최초로 조사하고, 롤아웃 정확도와 엔트로피를 경험 가치의 효과적인 지표로 식별합니다. 이러한 통찰을 바탕으로, 우리는 ExGRPO(Experiential Group Relative Policy Optimization)를 제안합니다. 이 프레임워크는 가치 있는 경험을 조직화하고 우선순위를 매기며, 탐색과 경험 활용 사이의 균형을 맞추기 위해 혼합 정책 목표를 사용합니다. 1.5B에서 8B 파라미터 규모의 5개 백본 모델에 대한 실험 결과, ExGRPO는 수학적/일반 벤치마크에서 추론 성능을 지속적으로 향상시켰으며, 온-폴리시 RLVR 대비 평균 +3.5/7.6 포인트의 성능 향상을 보였습니다. 또한 ExGRPO는 온-폴리시 방법이 실패한 강한 모델과 약한 모델 모두에서 학습을 안정화했습니다. 이러한 결과는 원칙적인 경험 관리가 효율적이고 확장 가능한 RLVR의 핵심 요소임을 강조합니다.

English

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

ExGRPO: 경험으로부터 추론 학습하기

ExGRPO: Learning to Reason from Experience

초록

Support