GRPO-MA: 안정적이고 효율적인 사고 사슬 훈련을 위한 GRPO의 다중 답변 생성

초록

최근 DeepSeek-R1과 같은 연구에서 강화 학습(Reinforcement Learning, RL) 접근법인 GRPO 알고리즘이 대규모 언어 모델(Large Language Models, LLMs)과 시각-언어 모델(Vision-Language Models, VLMs)에서의 사고 연쇄(Chain-of-Thought, CoT) 추론을 효과적으로 훈련시킬 수 있음을 보여주었다. 본 논문에서는 GRPO의 세 가지 주요 문제점, 즉 사고와 답변 간의 그래디언트 결합, 제한된 병렬 샘플링으로 인한 희소한 보상 신호, 그리고 불안정한 이점 추정을 분석한다. 이러한 문제를 완화하기 위해, 우리는 각 사고 과정에서 다중 답변 생성을 활용하여 더 강력하고 효율적인 최적화를 가능하게 하는 이론적으로 근거가 있는 간단한 방법인 GRPO-MA를 제안한다. 이론적으로, 우리는 사고당 답변 수가 증가함에 따라 사고 이점의 분산이 감소함을 보인다. 실험적으로, 그래디언트 분석을 통해 GRPO-MA가 GRPO에 비해 그래디언트 급증을 줄이는 효과를 확인하였다. 수학, 코드, 그리고 다양한 다중 모달 작업에 대한 실험은 GRPO-MA가 성능과 훈련 효율성을 크게 향상시킴을 입증한다. 추가적으로, 사고당 답변 수를 증가시키는 것이 모델 성능을 지속적으로 향상시킨다는 것을 우리의 절제 연구를 통해 확인하였다.

English

Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

GRPO-MA: 안정적이고 효율적인 사고 사슬 훈련을 위한 GRPO의 다중 답변 생성

GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

초록

Support