语言模型特权信息蒸馏
Privileged Information Distillation for Language Models
February 4, 2026
作者: Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
cs.AI
摘要
训练时特权信息(PI)能够使语言模型完成原本无法胜任的任务,成为困难长周期场景中强化学习的有力工具。然而,将基于PI习得的能力迁移至推理时无PI可用的策略仍存在根本性挑战。我们以前沿模型在多轮智能体环境中的蒸馏为研究背景,其中闭源系统通常隐藏内部推理过程仅公开行动轨迹。这使得标准蒸馏流程失效,因为成功行为可观测而推理过程不可见。为此,我们提出π-Distill——一种联合师生目标函数,使用同一模型同步训练PI条件化的教师模型和无条件的学生模型。此外,我们还提出策略上自蒸馏(OPSD),该方法通过强化学习结合学生模型与PI条件化教师模型之间的反向KL惩罚进行训练。实验表明,这两种算法都能有效利用仅含行动信息的PI蒸馏前沿智能体。具体而言,π-Distill及在某些情况下的OPSD,在多个智能体基准测试、模型架构和PI形式中,均优于假设能获取完整思维链监督的行业标准方法(监督微调后接强化学习)。我们通过深入分析补充实验结果,重点刻画了π-Distill实现有效PI学习的核心因素,并明确了OPSD具有竞争力的适用场景。
English
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that π-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.