GRPO-MA：安定かつ効率的な連鎖思考トレーニングのためのGRPOにおける複数回答生成

要旨

最近の進展、例えばDeepSeek-R1は、強化学習（RL）アプローチであるGRPOアルゴリズムが、大規模言語モデル（LLMs）や視覚言語モデル（VLMs）におけるChain-of-Thought（CoT）推論を効果的に訓練できることを示しています。本論文では、GRPOの3つの課題を分析します：思考と回答の間の勾配結合、限られた並列サンプリングによるスパースな報酬信号、そして不安定なアドバンテージ推定です。これらの課題を緩和するために、我々はGRPO-MAを提案します。これは、各思考プロセスから複数の回答を生成することを活用した、シンプルでありながら理論的に裏付けられた方法であり、より堅牢で効率的な最適化を可能にします。理論的には、思考ごとの回答数が増えるにつれて、思考アドバンテージの分散が減少することを示します。実験的には、勾配分析がこの効果を確認し、GRPO-MAがGRPOと比較して勾配スパイクを減少させることを示しています。数学、コード、多様なマルチモーダルタスクにおける実験は、GRPO-MAが性能と訓練効率を大幅に向上させることを実証しています。我々のアブレーション研究はさらに、思考ごとの回答数を増やすことがモデルの性能を一貫して向上させることを明らかにしています。

English

Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

GRPO-MA：安定かつ効率的な連鎖思考トレーニングのためのGRPOにおける複数回答生成

GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

要旨

Support