视觉至关重要:简单的视觉扰动即可提升多模态数学推理能力
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
June 11, 2025
作者: Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang
cs.AI
摘要
尽管多模态大语言模型(MLLMs)取得了快速进展,但它们在很大程度上忽视了视觉处理的重要性。在一个简单却富有启示性的实验中,我们有趣地发现,仅使用语言模型在提供图像描述的情况下,其表现可与甚至超越直接处理原始视觉输入的MLLMs相媲美。这表明当前的MLLMs或许能生成准确的视觉描述,但在推理过程中未能有效整合这些信息。受此启发,我们提出了一种简单的视觉扰动框架,该框架无需算法修改或额外训练数据即可增强感知鲁棒性。我们的方法引入了三种针对性扰动:干扰项拼接、保持主导性的混合以及随机旋转,这些扰动可轻松集成至包括SFT、DPO和GRPO在内的现有训练后流程中。通过跨多个数据集的广泛实验,我们展示了在数学推理性能上的一致提升,其增益与通过算法变更所实现的相当。此外,通过视觉扰动训练Qwen2.5-VL-7B,我们在开源7B RL调优模型中取得了竞争力表现。通过全面的消融研究,我们分析了不同扰动策略的有效性,揭示出每种扰动类型在视觉推理的不同方面均有独特贡献。我们的研究结果强调了视觉扰动在多模态数学推理中的关键作用:更好的推理始于更清晰的视觉。代码已发布于https://github.com/YutingLi0606/Vision-Matters。
English
Despite the rapid progress of multimodal large language models (MLLMs), they
have largely overlooked the importance of visual processing. In a simple yet
revealing experiment, we interestingly find that language-only models, when
provided with image captions, can achieve comparable or even better performance
than MLLMs that consume raw visual inputs. This suggests that current MLLMs may
generate accurate visual descriptions but fail to effectively integrate them
during reasoning. Motivated by this, we propose a simple visual perturbation
framework that enhances perceptual robustness without requiring algorithmic
modifications or additional training data. Our approach introduces three
targeted perturbations: distractor concatenation, dominance-preserving mixup,
and random rotation, that can be easily integrated into existing post-training
pipelines including SFT, DPO, and GRPO. Through extensive experiments across
multiple datasets, we demonstrate consistent improvements in mathematical
reasoning performance, with gains comparable to those achieved through
algorithmic changes. Additionally, we achieve competitive performance among
open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual
perturbation. Through comprehensive ablation studies, we analyze the
effectiveness of different perturbation strategies, revealing that each
perturbation type contributes uniquely to different aspects of visual
reasoning. Our findings highlight the critical role of visual perturbation in
multimodal mathematical reasoning: better reasoning begins with better seeing.
Our code is available at https://github.com/YutingLi0606/Vision-Matters.