蒸餾大型語言模型反饋以進行Lean定理證明

摘要

推理模型的後訓練通常結合監督式微調與基於可驗證獎勵的強化學習，最常見的是使用GRPO。然而，此演算法面臨稀疏獎勵、探索範圍有限以及模式崩潰等問題。基於近期關於自蒸餾的研究，我們提出反饋蒸餾（Feedback Distillation）訓練方法，該方法讓模型在詞元層級上，配對其自身於語言模型所產生的特權反饋條件下的分佈。反饋蒸餾能提供詞元層級的監督，並可注入外部知識。在Lean4定理證明任務上評估我們的方法後發現，反饋蒸餾在生成軌跡的多樣性上優於GRPO，從而實現更高的策略熵與更佳的pass@k縮放表現。這兩種方法具有互補性：從反饋蒸餾的檢查點初始化GRPO，其效果優於單獨使用任一種方法。總體而言，我們的結果為改善複雜推理的後訓練提供了有前景的方向。

English

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.