在强化学习中通过参考引导微调攻克难题

摘要

针对数学推理的强化学习常受奖励稀疏问题困扰：面对复杂问题时，大语言模型往往无法生成任何正确轨迹，导致强化学习缺乏有效正向反馈。尽管这类问题常附有人工编写的参考答案（如AoPS题库中的题目），但直接对这些解答进行微调收效甚微，因为模型通常难以模仿超出自身推理分布的人类证明思路。我们提出参考引导微调方法（ReGFT），通过利用人工参考答案为难题合成正向轨迹，并在强化学习前进行训练。该方法会为每个问题提供部分参考解答，让模型自主生成推理轨迹，确保所得轨迹既保持在模型推理空间内，又能获得参考指引的益处。基于这些参考引导轨迹的微调，不仅可提升模型可解决问题的数量，还能生成在强化学习阶段获得更多正向奖励的检查点。在三个基准测试（AIME24、AIME25、BeyondAIME）中，ReGFT持续提升监督学习准确率，加速DAPO训练进程，并推高强化学习的最终性能上限。实验结果表明，ReGFT能有效克服奖励稀疏问题，释放基于强化学习的数学推理潜力。

English

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

在强化学习中通过参考引导微调攻克难题

Learn Hard Problems During RL with Reference Guided Fine-tuning

摘要

Support