推理压缩的在线策略自蒸馏

摘要

推理模型会进行思维外化，但其输出内容大多属于信息噪声。我们提出的OPSDC（策略上自蒸馏推理压缩法）通过将模型自身的简洁推理行为蒸馏回模型内部，引导模型实现更凝练的推理。该方法可归结为一个核心思想：通过"保持简洁"的指令引导同一模型生成教师逻辑值，并在学生模型自身推演中实施逐标记反向KL最小化。无需标准答案、无需标记预算、无需难度评估器，仅需自蒸馏。然而这种简洁性背后蕴藏着惊人的精巧：OPSDC能自动对简单问题实施激进压缩，同时保留难题所需的审慎思考。在Qwen3-8B和Qwen3-14B模型上，我们在MATH-500数据集实现57-59%的标记压缩率，同时绝对准确率提升9-16个百分点。在AIME 2024测试中，140亿参数模型以41%的压缩率获得10分提升。其奥秘何在？推理模型产生的多数内容不仅是冗余的——更会随着每个多余标记不断放大错误，产生主动危害。

English

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.