基于策略的推理自蒸馏压缩
On-Policy Self-Distillation for Reasoning Compression
March 5, 2026
作者: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun
cs.AI
摘要
推理模型会进行思维显化表达,但其输出内容大多属于噪声。我们提出OPSDC(策略上自蒸馏推理压缩法),该方法通过将模型自身的简洁推理行为蒸馏回模型内部,引导其以更精炼的方式进行推理。整个方法可归结为一个核心思想:对同一模型施加"保持简洁"的指令以获得教师逻辑值,并在学生自身推演过程中逐词最小化反向KL散度。无需标准答案、无需词元预算、无需难度评估器——仅需自蒸馏。然而这种简洁性背后隐藏着惊人的精巧:OPSDC能自动对简单问题实施大幅压缩,同时保留解决难题所需的审慎思考。在Qwen3-8B和Qwen3-14B模型上,我们在MATH-500数据集实现57-59%的词元压缩率,同时绝对准确率提升9-16个百分点。在AIME 2024测试中,14B模型以41%的压缩率获得10分提升。其奥秘何在?推理模型的输出不仅存在冗余——更会主动产生危害,每个不必要的词元都在不断放大错误。
English
Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by
distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token
reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically
compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points
absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every
unnecessary token.