先看再推理:將感知與推理解耦以實現抗捷徑的多模態同策略自蒸餾
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
June 17, 2026
作者: Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han
cs.AI
摘要
本策略自蒸餾(OPSD)透過讓模型在其自身生成的軌跡上進行訓練,並使用凍結的複本提供以參考目標為條件的密集詞元級目標。此方法在大型語言模型的推理任務中表現良好,但直接延伸至多模態大型語言模型(MLLMs)可能產生捷徑:具備特權的目標可能主要根據文字參考目標而非圖像來引導詞元。為此,我們提出ViGOS——一個基於視覺引導的OPSD框架,用於MLLM的後訓練階段。學生模型首先撰寫視覺描述,再據此推理出最終答案。對於有效的軌跡,由純圖像感知教師監督描述部分,而具備特權的推理教師則在同一學生前綴上監督推理過程與最終答案。僅針對無效軌跡使用參考教師來恢復輸出格式。在通用視覺語言、專家推理、視覺數學、空間定位及視覺語言先驗基準測試中,ViGOS保留了OPSD的主要優勢,並在易產生捷徑的情境中改善了基於圖像的行為表現。
English
On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.