추론보다 먼저 보기: 단축 경로에 강건한 멀티모달 온-폴리시 자기 증류를 위한 지각과 추론의 디커플링

초록

온-정책 자기 증류(OPSD)는 자체 롤아웃(rollouts)에 대해 모델을 학습시키며, 참조 대상에 조건화된 밀집 토큰 수준의 목표를 제공하기 위해 고정된 복사본을 사용한다. 이는 LLM 추론에 효과적이지만, 멀티모달 대형 언어 모델(MLLM)로의 직접적인 확장은 지름길(shortcut)을 만들 수 있다. 즉, 특권적 목표가 이미지보다 텍스트 참조 대상에 주로 기반하여 토큰을 안내할 수 있다. 우리는 MLLM 사후 학습을 위한 시각적 기반 OPSD 프레임워크인 ViGOS를 제안한다. 학생 모델은 먼저 시각적 설명을 작성한 후 최종 답을 향해 추론한다. 유효한 롤아웃의 경우, 이미지 전용 지각 교사(perception teacher)가 설명을 감독하고, 특권적 추론 교사(privileged reasoning teacher)가 동일한 학생 모델의 접두사(prefix)에 대한 추론과 최종 답을 감독한다. 참조 교사는 출력 형식을 복구하기 위해 유효하지 않은 롤아웃에만 사용된다. 일반 시각-언어, 전문 추론, 시각 수학, 공간 접지, 시각-언어 사전 벤치마크 전반에서 ViGOS는 OPSD의 주요 이점을 유지하고 지름길이 발생하기 쉬운 환경에서 이미지 기반 행동을 개선한다.

English

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.