ThinkTwice：推論と自己改善のための大規模言語モデルの共同最適化

要旨

私たちは、Group Relative Policy Optimization（GRPO）に基づき、推論問題を解決するフェーズと回答を洗練させるフェーズを共同最適化する、シンプルな二段階フレームワーク「ThinkTwice」を提案します。各トレーニングステップのペアにおいて、ThinkTwiceはまず推論問題の解決においてモデルを最適化し、次に同じ問題に対する自身の解答を洗練させるように最適化します。この両フェーズでは、正解信号や批評アノテーションを用いず、同じ二値的正解報酬を使用します。5つの数学的推論ベンチマークと、Qwen3-4BおよびOlmo3-7Bを含む2つのモデルファミリーにわたる評価では、ThinkTwiceは競合するオンライン方策最適化ベースラインと比較して、推論性能と洗練性能の両方を大幅に向上させました。具体的には、Qwen3-4Bにおいて、ThinkTwiceはAIMEベンチマークで、洗練前ではGRPOを5パーセントポイント、1回の自己洗練後では11.5パーセントポイント（pass@4測定）上回りました。ThinkTwiceのトレーニング動態を分析すると、暗黙的な「修正し強化する」カリキュラムが明らかになります。トレーニング初期には洗練が主に誤りを修正し、モデルが改善されるにつれて、自然に正解済みの解答を維持する方向に移行し、より修正された報酬信号をもたらします。本研究は、推論と自己洗練の共同トレーニングが、RLVRにおける原理的かつ効果的な方法論であることを実証します。

English

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

ThinkTwice：推論と自己改善のための大規模言語モデルの共同最適化

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

要旨

Support