推論の前に見る：ショートカットに頑健なマルチモーダル・オン・ポリシー自己蒸留のための知覚と推論の分離

要旨

オン方策自己蒸留（OPSD）は、モデル自身のロールアウトで訓練し、参照ターゲットを条件とする密なトークンレベルのターゲットを凍結コピーが提供する手法である。これはLLMの推論には有効だが、マルチモーダル大規模言語モデル（MLLM）への直接的な拡張はショートカットを生みかねない。すなわち、特権的なターゲットが主にテキストの参照ターゲットに基づいてトークンを導き、画像を無視する可能性がある。本稿では、MLLMの事後訓練のための視覚的に基づいたOPSDフレームワーク、ViGOSを提案する。生徒モデルはまず視覚的な記述を書き、その後に最終解答に向けて推論を行う。有効なロールアウトでは、画像のみの知覚教師が記述を監督し、特権的な推論教師が同じ生徒のプレフィックス上の推論と最終解答を監督する。無効なロールアウトに対しては、出力形式を回復するために参照教師のみを使用する。一般的な視覚言語、専門家推論、視覚数学、空間的接地、および視覚言語事前知識のベンチマークにおいて、ViGOSはOPSDの主な利点を維持しつつ、ショートカットが生じやすい設定で画像に基づく振る舞いを改善する。

English

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.