모든 불일치가 학습 가능한 것은 아니다: 온-정책 증류에서의 토큰 교수 가능성

초록

온-정책 증류(On-policy distillation, OPD)는 학생 모델이 자체적인 롤아웃(rollout)에서 토큰 수준의 교사 모델 지도(supervision)를 받아 학습하는 방식이다. 최근의 선택적 OPD 방법은 높은 엔트로피 또는 높은 불일치(disagreement)를 보이는 토큰을 우선시함으로써 OPD 신호의 비균일성을 활용한다. 우리는 이 원리를 재검토하며 다음과 같은 질문을 제기한다: 토큰 수준의 교사 신호 중 실제로 학습 가능한(learnable) 것은 무엇인가? 동일 문맥에서 교사-학생 KL 감소를 측정하는 고정 문맥 진단법(fixed-context diagnostic)을 사용하여, 우리는 원시 KL 불일치가 학습 가치에 대한 조악한 대리 지표(coarse proxy)임을 보여준다. 이는 교사가 학생의 상위-K 후보에 교정 질량(corrective mass)을 할당하는 학습 가능한 불일치(learnable disagreement)와, 교사가 질량을 주로 학생의 현재 지지 집합(support) 밖에 배치하는 양립 불가능한 불일치(incompatible disagreement)를 혼동한다. 우리는 이러한 국소적 양립성을 토큰 학습 가능성(token teachability)으로 정식화하고, 이것이 원시 KL만으로 측정한 것보다 고정 문맥에서의 개선을 더 잘 예측함을 보인다. 이 발견에 기초하여, 우리는 보상 모델(reward model)이나 검증기(verifier) 없이 높은 학습 가능성(high-teachability) 위치에 OPD 손실을 적용하는 경량의 토큰 위치 선택 방법인 학습 가능성 인지 OPD(Teachability-Aware OPD, TA-OPD)를 제안한다. Qwen2.5 및 Qwen 3 교사-학생 설정에서, TA-OPD는 단 5%의 유지된 토큰만으로도 전체 토큰 OPD를 종종 능가하며, 엔트로피 기반 및 발산 기반 기준선보다 개선된 성능을 보인다. 우리의 결과는 선택적 OPD를 단순히 현저한 토큰(salient token)을 선택하는 것이 아니라, 학습 가능한 교사 신호를 선택하는 것으로 재구성한다.

English

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.