OmniOPD: 추측 검증을 통한 로짓 없는 온-정책 증류

초록

온-정책 증류(OPD)는 더 강력한 교사 모델로부터의 밀집된 토큰 수준 피드백 하에 학생 모델을 자체 생성 궤적에 대해 훈련시킴으로써, 지도 미세 조정(SFT)의 오프-정책 분포 이동과 강화 학습(RL)의 희소한 신용 할당 문제를 모두 완화한다. 그러나 표준 OPD는 두 가지 결합된 한계에 직면한다. 첫째, 교사 모델의 토큰 수준 로짓에 직접 접근해야 하므로, 다양한 강력한 독점 모델을 교사로 활용할 수 없다. 둘째, 토큰 수준 로짓 신호 자체가 깨지기 쉬워 교사와 학생 간의 제한된 다음 토큰 중복에 의존하며, 반복 루프와 같은 변칙적 패턴을 증폭시키기 쉽다. 본 논문에서는 로짓이 필요 없고 청크 수준의 감독 신호를 통해 두 가지 한계를 모두 해결하는 새로운 프레임워크인 OmniOPD를 소개한다. OmniOPD는 결정론적 로짓 매칭을 다중 토큰 청크에 대한 연속적인 의미 유사성 지표를 통해 교사 모델의 지역적 선호도를 근사하는 몬테카를로 롤아웃으로 대체하며, 피크-엔트로피 스케줄러를 통해 학생 모델의 높은 불확실성 추론 분기점에서만 집중적으로 감독을 수행한다. 디리클레-다항 베이즈 사전 분포와 기본 모델 KL 앵커는 이산 샘플링의 분산을 추가로 제한하고 감독되지 않은 토큰에서의 정책 붕괴를 방지한다. 경쟁력 있는 벤치마크 전반에 걸쳐, OmniOPD는 수학 분야에서 표준 OPD 접근법 대비 최대 +28.64%의 성능 향상을 보이며, 청크 수준의 의미 검증이 토큰 수준 로짓 매칭보다 더 신뢰할 수 있는 학습 신호를 추출함을 확인시켜 준다. 토큰 수준 로짓 매칭은 높은 정보 밀도에도 불구하고 상당한 잡음과 취약성으로 인해 그 장점이 상쇄된다. 더 나아가, Claude-4.5-Haiku 및 Gemini-2.5-Flash와 같은 더 강력한 블랙박스 교사 모델과 결합될 경우, OmniOPD는 오픈 웨이트 교사 모델 대비 수학 분야에서 상대적으로 추가 +9.54%의 성능 향상을 달성하며, 학생 모델을 자기 탐색 강화 학습의 성능을 넘어서도록 발전시킨다.

English

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.