ChatPaper.aiChatPaper

視覺至關重要:簡易視覺擾動可提升多模態數學推理能力

Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

June 11, 2025
作者: Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang
cs.AI

摘要

儘管多模態大語言模型(MLLMs)取得了快速進展,但它們在很大程度上忽視了視覺處理的重要性。在一項簡單卻具啟發性的實驗中,我們有趣地發現,僅依賴語言的模型在提供圖像描述時,其表現可與甚至超越那些處理原始視覺輸入的MLLMs相媲美。這表明,當前的MLLMs或許能生成精確的視覺描述,但在推理過程中未能有效整合這些信息。基於此,我們提出了一種簡便的視覺擾動框架,該框架無需算法修改或額外訓練數據,即可增強感知魯棒性。我們的方法引入了三種針對性的擾動策略:干擾項拼接、保持主導性的混合以及隨機旋轉,這些策略可輕鬆整合至包括SFT、DPO和GRPO在內的現有訓練後流程中。通過在多個數據集上的廣泛實驗,我們展示了在數學推理性能上的一致提升,其增益與通過算法變更所達到的效果相當。此外,通過對Qwen2.5-VL-7B模型施加視覺擾動進行訓練,我們在開源7B RL調優模型中取得了競爭力的表現。通過全面的消融研究,我們分析了不同擾動策略的有效性,揭示了每種擾動類型在視覺推理的不同方面均發揮獨特作用。我們的研究結果強調了視覺擾動在多模態數學推理中的關鍵作用:更好的推理始於更清晰的視覺。我們的代碼已公開於https://github.com/YutingLi0606/Vision-Matters。
English
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
PDF102June 12, 2025