推論圧縮のためのオンポリシー自己蒸留

要旨

推論モデルは思考過程を声に出して行うが、その発話の多くはノイズである。本論文では、モデル自身の簡潔な振る舞いを自己蒸留することで、より簡潔に推論することを学習させる手法OPSDC（On-Policy Self-Distillation for Reasoning Compression）を提案する。この手法全体は、一つの単純なアイデアに集約される。「簡潔にせよ」という指示を与えた同じモデルから教師ロジットを取得し、学生モデル自身のロールアウトに対してトークン単位の逆KLダイバージェンスを最小化する。正解データも、トークン予算も、難易度推定器も不要。ただ自己蒸留のみである。しかし、この単純さは驚くべき洗練さを内包する。OPSDCは、難しい問題に必要な考察を保ちつつ、簡単な問題は積極的に圧縮する。Qwen3-8BおよびQwen3-14Bにおいて、MATH-500では57-59%のトークン削減を達成し、精度は絶対値で9-16ポイント向上させた。AIME 2024では、14Bモデルが41%の圧縮率で10ポイントの精度向上を示した。その秘訣は何か？推論モデルが生成する内容の多くは、単に冗長なだけでなく、積極的に有害であり、不必要なトークンが増えるごとに誤りを増幅するのである。

English

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.