面向大模型推理的强化感知知识蒸馏

摘要

强化学习（RL）后训练技术近期显著提升了长链思维推理大语言模型（LLM）的性能，但这类模型的高推理成本催生了向更小型学生模型的知识蒸馏需求。现有知识蒸馏（KD）方法多针对监督微调（SFT）设计，依赖固定的教师轨迹或基于师生KL散度的正则化。当与强化学习结合时，这些方法常面临分布失配和目标冲突问题：教师监督可能与学生动态演化的策略分布不匹配，而KL正则项会与奖励最大化目标相互竞争，需要精细的损失平衡。为解决这些问题，我们提出强化学习感知蒸馏（RLAD），在RL过程中实施选择性模仿——仅当能改进当前策略更新时，才引导学生模型向教师模型靠拢。其核心组件信任域比率蒸馏（TRRD）采用PPO/GRPO风格的似然比目标替代师生KL正则项，该目标锚定于教师-旧策略混合分布，在学生模型的策略轨迹上实现优势感知、信任域约束的蒸馏，自然平衡探索、利用与模仿三大目标。在多项逻辑推理与数学基准测试中，RLAD持续优于离线蒸馏、标准GRPO以及基于KL的正则化师生知识蒸馏方法。

English

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

面向大模型推理的强化感知知识蒸馏

Reinforcement-aware Knowledge Distillation for LLM Reasoning

摘要

Support