ChatPaper.aiChatPaper

OVD:基于策略的言语蒸馏

OVD: On-policy Verbal Distillation

January 29, 2026
作者: Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Ngai Wong
cs.AI

摘要

知識蒸餾為將大型教師模型的推理能力遷移至高效能學生模型提供了可行路徑,然而現有的詞元級在線蒸餷方法要求師生模型間保持詞元級對齊,這不僅限制了學生模型的探索能力,阻礙了交互環境反饋的有效利用,還在強化學習中面臨嚴重的記憶體瓶頸。我們提出在線語義蒸餾(OVD)這一記憶體高效框架,通過採用教師模型給出的離散語義評分(0-9分)進行軌跡匹配,取代傳統的詞元級概率匹配。OVD在實現基於語義反饋的在線蒸餾同時,顯著降低了記憶體消耗,並通過避免詞元級對齊使學生模型能自由探索輸出空間。在網絡問答和數學推理任務上的大量實驗表明,OVD顯著優於現有方法——在網絡問答任務中平均精確匹配率絕對提升達12.9%,數學基準測試中最高提升達25.7%(僅使用單個隨機樣本訓練時),同時展現出更優的訓練效率。項目頁面請訪問:https://OVD.github.io
English
Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io
PDF22February 7, 2026