蒸馏大语言模型反馈用于Lean定理证明

摘要

推理模型的后训练通常结合监督微调与基于可验证奖励的强化学习，其中GRPO是最常见的方法。然而，该算法存在奖励稀疏、探索受限及模式坍缩等问题。基于近期自蒸馏研究，我们提出反馈蒸馏训练方法：该方法在词元级别上，使模型自身的分布与语言模型提供的特权反馈条件分布相匹配。反馈蒸馏提供词元级监督，并能够注入外部知识。通过在Lean4定理证明任务上评估该方法，我们发现反馈蒸馏比GRPO能维持更高的生成轨迹多样性，从而产生更高的策略熵和更优的pass@k缩放效果。两种方法具有互补性：从反馈蒸馏检查点初始化GRPO，其性能优于单独使用任一方法。总体而言，我们的研究结果为改善复杂推理任务的后训练提供了一条有前景的途径。

English

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.