先看后思:解耦感知与推理实现抗捷径的多模态策略内自蒸馏
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
June 17, 2026
作者: Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han
cs.AI
摘要
在线策略自蒸馏(OPSD)方法通过让模型基于自身生成路径进行训练,并利用冻结副本提供以参考目标为条件的密集词元级目标。该方法在大语言模型推理中表现良好,但直接扩展到多模态大语言模型(MLLMs)时可能产生捷径:特权目标可能主要依据文本参考目标而非图像引导词元。我们提出ViGOS——一种面向MLLM后训练的视觉引导在线策略自蒸馏框架。学生模型首先撰写视觉描述,继而推理至最终答案。针对有效生成路径,一个仅基于图像的感知教师负责监督描述部分,而一个特权推理教师则在相同学生前缀上监督推理过程与最终答案。仅对无效生成路径使用参考教师以恢复输出格式。在通用视觉语言、专家推理、视觉数学、空间定位及视觉语言先验等基准测试中,ViGOS保留了在线策略自蒸馏的主要优势,并改善了易发生捷径场景下基于图像的行为。
English
On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.