在线策略语言蒸馏

摘要

知识蒸馏为将大型教师模型的推理能力迁移至高效学生模型提供了可行路径；然而现有的词级同策略蒸馏方法要求学生与教师模型保持词级对齐，这限制了学生模型的探索能力，阻碍了交互环境反馈的有效利用，并在强化学习中面临严重的内存瓶颈。我们提出同策略语言蒸馏（OVD）这一内存高效框架，通过采用教师模型提供的离散语言评分（0-9分）进行轨迹匹配，替代原有的词级概率匹配。OVD在实现基于语言反馈的同策略蒸馏的同时，将内存消耗显著降低，且无需词级对齐，使学生模型能够自由探索输出空间。在网页问答和数学推理任务上的大量实验表明，OVD显著优于现有方法——在网页问答任务上平均精确匹配率绝对提升最高达12.9%，在数学基准测试中（仅使用单次随机采样训练）最高提升达25.7%，同时展现出更优的训练效率。项目页面详见：https://OVD.github.io

English

Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io

在线策略语言蒸馏

OVD: On-policy Verbal Distillation

摘要

Support