팁: 온-폴리시 디스틸레이션에서 토큰 중요도

초록

온-정책 지식 증류(OPD)는 교사 모델의 토큰 수준 감독 하에 학생 모델을 자신의 롤아웃 데이터로 학습시킵니다. 모든 토큰 위치가 동등하게 중요하지는 않지만, 기존의 토큰 중요도에 대한 관점은 불완전합니다. 우리는 다음과 같은 직접적인 질문을 던집니다: OPD에서 가장 유용한 학습 신호를 전달하는 토큰은 무엇인가? 우리의 답은 정보성 높은 토큰이 두 가지 영역에서 나온다는 것입니다: 학생 모델의 엔트로피가 높은 위치, 그리고 학생 모델의 엔트로피가 낮으면서 교사-학생 모델 간 발산도가 높은 위치(즉, 학생 모델이 과도하게 확신하면서 틀리는 경우)입니다. 실험적으로, 학생 모델의 엔트로피는 강력한 1차 근사치입니다: 엔트로피 기반 샘플링으로 상위 50% 토큰만 유지했을 때 전체 토큰 학습 성능을 맞추거나 넘어섰으며, 최대 메모리 사용량을 47%까지 줄였습니다. 그러나 엔트로피만으로는 두 번째 중요한 영역을 놓칭니다. 낮은 엔트로피와 높은 발산도를 동시에 보이는 토큰만 분리하여 학습할 경우, 전체 토큰의 10% 미만으로도 전체 토큰 기준선에 근접하는 성능을 보여, 과신 토큰이 엔트로피 단일 규칙에서는 거의 보이지 않더라도 매우 집약된 수정 신호를 운반함을 입증했습니다. 우리는 이러한 발견을 TIP(온-정책 증류에서의 토큰 중요도)라는 학생 엔트로피와 교사-학생 발산도라는 두 축을 가진 분류 체계로 체계화하고, 엔트로피가 유용하지만 구조적으로 불완전한 이유에 대한 이론적 설명을 제시합니다. 이 관점은 불확실성과 불일치를 결합한 유형 인식 토큰 선택 규칙의 동기가 됩니다. 우리는 이 관점을 Qwen3, Llama, Qwen2.5를 아우르는 세 가지 교사-학생 모델 쌍과 MATH-500, AIME 2024/2025 데이터셋, 그리고 장기 행동 계획을 위한 DeepPlanning 벤치마크에서 검증했습니다. 특히 DeepPlanning에서는 전체 토큰의 20% 미만에 대한 Q3 전용 학습이 전체 토큰 OPD를 능가했습니다. 우리의 실험은 OPD 저장소(https://github.com/HJSang/OPSD_OnPolicyDistillation)를 확장하여 구현되었으며, 이 저장소는 제한된 GPU 예산 하에서 더 큰 모델의 메모리 효율적 증류를 지원합니다.

English

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

팁: 온-폴리시 디스틸레이션에서 토큰 중요도

TIP: Token Importance in On-Policy Distillation

초록

Support