基于策略的推理自蒸馏压缩

摘要

推理模型会进行思维显化表达，但其输出内容大多属于噪声。我们提出OPSDC（策略上自蒸馏推理压缩法），该方法通过将模型自身的简洁推理行为蒸馏回模型内部，引导其以更精炼的方式进行推理。整个方法可归结为一个核心思想：对同一模型施加"保持简洁"的指令以获得教师逻辑值，并在学生自身推演过程中逐词最小化反向KL散度。无需标准答案、无需词元预算、无需难度评估器——仅需自蒸馏。然而这种简洁性背后隐藏着惊人的精巧：OPSDC能自动对简单问题实施大幅压缩，同时保留解决难题所需的审慎思考。在Qwen3-8B和Qwen3-14B模型上，我们在MATH-500数据集实现57-59%的词元压缩率，同时绝对准确率提升9-16个百分点。在AIME 2024测试中，14B模型以41%的压缩率获得10分提升。其奥秘何在？推理模型的输出不仅存在冗余——更会主动产生危害，每个不必要的词元都在不断放大错误。

English

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.