强化学习中的难题学习：基于参考引导的微调方法

摘要

数学推理中的强化学习常受奖励稀疏问题困扰：面对复杂题目时，大语言模型难以生成任何正确轨迹，导致强化学习缺乏有效正向反馈。尽管题目常附有人工编写的参考答案（如AoPS题库），但直接对这些解答进行微调收效甚微——模型往往难以模仿超出其自身推理分布的人类证明。我们提出参考引导微调法（ReGFT），通过参考答案为难题合成正向轨迹并在强化学习前进行训练。该方法针对每道题目，先向模型提供部分参考答案片段，再由其自主生成推理轨迹，确保所得轨迹既保持在模型推理空间内，又能获得参考指引。基于参考引导轨迹的微调可提升模型可解题目的数量，并生成能在强化学习阶段获得更多正向奖励的检查点。在AIME24、AIME25和BeyondAIME三个基准测试中，ReGFT持续提升监督学习准确率，加速DAPO训练进程，并推高强化学习的最终性能平台。实验结果表明，ReGFT能有效克服奖励稀疏问题，释放基于强化学习的数学推理潜力。

English

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.