ChatPaper.aiChatPaper

视觉至关重要:简单的视觉扰动即可提升多模态数学推理能力

Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

June 11, 2025
作者: Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang
cs.AI

摘要

尽管多模态大语言模型(MLLMs)取得了快速进展,但它们在很大程度上忽视了视觉处理的重要性。在一个简单却富有启示性的实验中,我们有趣地发现,仅使用语言模型在提供图像描述的情况下,其表现可与甚至超越直接处理原始视觉输入的MLLMs相媲美。这表明当前的MLLMs或许能生成准确的视觉描述,但在推理过程中未能有效整合这些信息。受此启发,我们提出了一种简单的视觉扰动框架,该框架无需算法修改或额外训练数据即可增强感知鲁棒性。我们的方法引入了三种针对性扰动:干扰项拼接、保持主导性的混合以及随机旋转,这些扰动可轻松集成至包括SFT、DPO和GRPO在内的现有训练后流程中。通过跨多个数据集的广泛实验,我们展示了在数学推理性能上的一致提升,其增益与通过算法变更所实现的相当。此外,通过视觉扰动训练Qwen2.5-VL-7B,我们在开源7B RL调优模型中取得了竞争力表现。通过全面的消融研究,我们分析了不同扰动策略的有效性,揭示出每种扰动类型在视觉推理的不同方面均有独特贡献。我们的研究结果强调了视觉扰动在多模态数学推理中的关键作用:更好的推理始于更清晰的视觉。代码已发布于https://github.com/YutingLi0606/Vision-Matters。
English
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
PDF92June 12, 2025