深思雙重：聯合優化大型語言模型以實現推理與自我精進

摘要

我們提出ThinkTwice——一個基於群組相對策略優化（GRPO）的簡潔雙階段框架，通過聯合優化大型語言模型來解決推理問題並精煉答案。在每對訓練步驟中，ThinkTwice首先針對解決推理問題優化模型，隨後針對精煉其對相同問題的自身解答進行優化，兩個階段均使用相同的二元正確性獎勵，無需依賴正確性信號或批判性註釋。在五個數學推理基準測試及包含Qwen3-4B與Olmo3-7B的兩種模型系列中，ThinkTwice相較於競爭性線上策略優化基線模型，顯著提升了推理與精煉雙方面的性能。具體而言，在Qwen3-4B模型上，以pass@4指標衡量，ThinkTwice在精煉前於AIME基準領先GRPO 5個百分點，經過一次自我精煉步驟後優勢擴大至11.5個百分點。對ThinkTwice訓練動態的分析揭示出一種隱性的「校正後強化」課程機制：精煉過程在訓練早期主要修正錯誤，隨著模型能力提升，會自然轉向保持已正確的解答，從而產生更純化的獎勵信號。本研究確立了推理與自我精煉的聯合訓練作為強化學習從反饋中學習（RLVR）的一種原則性且高效的方法論。

English

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

深思雙重：聯合優化大型語言模型以實現推理與自我精進

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

摘要

Support