추론 압축을 위한 온-정책 자기 지식 증류

초록

사고 모델은 생각을 소리 내어 표현하지만, 그 내용의 상당 부분은 노이즈에 불과합니다. 우리는 OPSDC(추론 압축을 위한 온-폴리시 자기 지식 증류)를 소개합니다. 이는 모델이 자신의 간결한 사고 행동을 스스로에게 다시 증류함으로써 더 간결하게 사고하도록 가르치는 방법론입니다. 전체 접근법은 한 가지 아이디어로 요약됩니다: 동일한 모델에 "간결하게 사고하라"는 지시를 제공하여 교사 로짓을 얻고, 학생 모델의 자체 롤아웃에서 토큰별 역 KL 발산을 최소화하는 것입니다. 정답 데이터도, 토큰 예산도, 난이도 추정기도 필요하지 않습니다. 오직 자기 지식 증류뿐입니다. 그러나 이러한 단순함 속에 놀라운 정교함이 숨어 있습니다: OPSDC는 어려운 문제에 필요한 숙고는 보존하면서 쉬운 문제는 적극적으로 자동으로 압축합니다. Qwen3-8B와 Qwen3-14B에서 MATH-500 평가 시 정확도를 9-16%p 절대적으로 향상시키면서 57-59%의 토큰 감소를 달성했습니다. AIME 2024에서는 14B 모델이 41% 압축률로 10점의 성능 향상을 보였습니다. 비결은 무엇일까요? 사고 모델이 생성하는 내용의 상당 부분은 단순히 중복되는 것을 넘어, 적극적으로 해롭습니다. 불필요한 모든 토큰이 오류를 증폭시키기 때문입니다.

English

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.