大型语言模型的黑盒在线策略蒸馏 (注:此处"On-Policy"在强化学习语境中通常译为"在线策略"或"同策略",考虑到与蒸馏技术的结合,采用"在线策略"的译法更能体现实时学习特性。)
Black-Box On-Policy Distillation of Large Language Models
November 13, 2025
作者: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei
cs.AI
摘要
黑盒蒸馏技术仅通过从专有教师模型的文本输出中学习,无需访问其内部逻辑或参数,即可创建学生大语言模型(LLM)。本研究提出生成对抗蒸馏(GAD)方法,实现了在线策略的黑盒蒸馏。GAD将学生LLM构建为生成器,并训练判别器来区分其响应与教师LLM的响应,形成极小极大博弈框架。判别器作为随学生模型协同演进的在线策略奖励模型,能提供稳定自适应的反馈。实验结果表明,GAD持续超越常用的序列级知识蒸馏方法。特别值得注意的是,采用GAD训练的Qwen2.5-14B-Instruct(学生模型)在LMSYS-Chat自动评估中达到了与教师模型GPT-5-Chat相媲美的性能。这些成果确立了GAD作为黑盒LLM蒸馏领域一种前景广阔且高效的新范式。
English
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.