深思熟虑：协同优化大语言模型的推理与自我精炼能力

摘要

我们提出ThinkTwice——一个基于群体相对策略优化（GRPO）的简易双阶段框架，通过联合优化大语言模型来解决推理问题并精炼答案。在每对训练步骤中，ThinkTwice首先优化模型解决推理问题的能力，随后针对同一问题集优化其自我答案精炼能力，两个阶段均使用相同的二元正确性奖励，且无需正确性信号或批判性标注。在涵盖五个数学推理基准测试及Qwen3-4B、Olmo3-7B两大模型系列的实验中，ThinkTwice在推理和精炼性能上均显著优于竞争性在线策略优化基线方法。具体而言，在Qwen3-4B模型上，ThinkTwice在AIME基准的pass@4指标上较GRPO方法提升5个百分点（精炼前），经过一次自我精炼后优势扩大至11.5个百分点。对ThinkTwice训练动态的分析揭示出一种隐式的“纠错-强化”课程机制：训练早期精炼过程主要修正错误，随着模型能力提升，会自然转向保持已正确解的完整性，从而产生更纯净的奖励信号。本研究确立了推理与自我精炼的联合训练作为强化学习与价值对齐（RLVR）的一种原则性高效方法论。

English

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.