语言模型特权信息蒸馏
Privileged Information Distillation for Language Models
February 4, 2026
作者: Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
cs.AI
摘要
训练时特权信息能够使语言模型在原本无法完成的任务上取得成功,这使其成为困难、长周期场景下强化学习的强大工具。然而,将基于特权信息习得的能力迁移至推理时无法使用该信息的策略,仍是一个根本性挑战。我们在前沿模型蒸馏至多轮智能体环境的背景下研究该问题——此类环境中,闭源系统通常隐藏内部推理过程,仅暴露行动轨迹。这使得标准蒸馏流程失效,因为成功行为可观测而推理过程不可见。为此,我们提出π-Distill算法,通过联合师生目标函数,使用同一模型同步训练特权信息条件化的教师模型与无条件的学生模型。此外,我们还提出策略上自蒸馏方法,该替代方案通过强化学习进行训练,并在学生模型与特权信息条件化教师模型间施加反向KL惩罚。实验表明,这两种算法都能有效利用仅含行动轨迹的特权信息蒸馏前沿智能体。具体而言,我们发现π-Distill(在某些情况下OPSD)在多个智能体基准测试、模型架构及特权信息形式上,均优于假设能获取完整思维链监督的行业标准实践(监督微调后接强化学习)。我们通过深入分析补充实验结果,重点刻画了π-Distill实现有效学习的核心因素,并明确了OPSD具备竞争力的适用场景。
English
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that π-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.