ChatPaper.aiChatPaper

強化學習感知的知識蒸餾於大型語言模型推理 (注:此標題採用學術論文標題的常見譯法,兼顧技術術語準確性與中文表達習慣。"Reinforcement-aware"譯為「強化學習感知」能精準傳達結合強化學習機制的特點,"Knowledge Distillation"採用業界通用譯法「知識蒸餾」,後置修飾結構轉換為中文前置介詞短語,符合中文標題簡潔性要求。)

Reinforcement-aware Knowledge Distillation for LLM Reasoning

February 26, 2026
作者: Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto
cs.AI

摘要

強化學習(RL)後訓練技術近期在長鏈思維推理的大型語言模型(LLM)領域取得顯著進展,但這類模型的高推理成本促使業界尋求將其蒸餾為更小規模的學生模型。現有多數知識蒸餾(KD)方法專為監督式微調(SFT)設計,依賴固定教師軌跡或基於師生KL散度的正則化。當與強化學習結合時,這些方法常面臨分佈失配與目標干擾問題:教師監督可能與學生模型動態演進的滾動分佈不匹配,且KL正則項會與獎勵最大化目標相互競爭,需精心調整損失權衡。為解決這些問題,我們提出RL感知蒸餾法(RLAD),在強化學習過程中實施選擇性模仿——僅當教師指導能改善當前策略更新時,才引導學生模型向教師靠攏。其核心組件「信任區域比率蒸餾」(TRRD)以PPO/GRPO風格的似然比目標取代師生KL正則項,該目標錨定於教師-舊策略混合分佈,在學生模型的滾動數據上實現優勢感知、信任區域約束的蒸餾,自然平衡探索、利用與模仿三者的關係。在多元邏輯推理與數學基準測試中,RLAD持續優於離線蒸餾、標準GRPO以及基於KL的線上師生知識蒸餾方法。
English
Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
PDF12March 7, 2026