リーン定理証明のためのLLMフィードバックの蒸留

要旨

推論モデルのポストトレーニングは通常、検証可能な報酬を用いた強化学習（特にGRPO）と教師ありファインチューニングを組み合わせる。しかし、このアルゴリズムはスパースな報酬、限定的な探索、およびモード崩壊の問題を抱えている。近年の自己蒸留に関する研究を基に、我々はフィードバック蒸留（Feedback Distillation）を提案する。これは、モデルがトークンレベルで、言語モデルによって生成された特権的フィードバックを条件とした自身の分布に一致するよう訓練される手法である。フィードバック蒸留はトークンレベルの教師信号を提供し、外部知識を注入することができる。本手法をLean4による定理証明で評価した結果、フィードバック蒸留はGRPOよりも生成された軌跡の多様性を維持し、より高い方策エントロピーと優れたpass@kスケーリングを示した。両手法は相補的であり、フィードバック蒸留のチェックポイントからGRPOを初期化することで、単独使用のいずれの手法よりも優れた性能が得られる。総じて、これらの結果は複雑な推論におけるポストトレーニングの改善に向けた有望な方向性を示唆している。

English

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.