推理压缩的在线策略自蒸馏
On-Policy Self-Distillation for Reasoning Compression
March 5, 2026
作者: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun
cs.AI
摘要
推理模型会进行思维外化,但其输出内容大多属于信息噪声。我们提出的OPSDC(策略上自蒸馏推理压缩法)通过将模型自身的简洁推理行为蒸馏回模型内部,引导模型实现更凝练的推理。该方法可归结为一个核心思想:通过"保持简洁"的指令引导同一模型生成教师逻辑值,并在学生模型自身推演中实施逐标记反向KL最小化。无需标准答案、无需标记预算、无需难度评估器,仅需自蒸馏。然而这种简洁性背后蕴藏着惊人的精巧:OPSDC能自动对简单问题实施激进压缩,同时保留难题所需的审慎思考。在Qwen3-8B和Qwen3-14B模型上,我们在MATH-500数据集实现57-59%的标记压缩率,同时绝对准确率提升9-16个百分点。在AIME 2024测试中,140亿参数模型以41%的压缩率获得10分提升。其奥秘何在?推理模型产生的多数内容不仅是冗余的——更会随着每个多余标记不断放大错误,产生主动危害。
English
Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by
distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token
reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically
compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points
absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every
unnecessary token.