시각의 중요성: 단순한 시각적 교란이 다중모드 수학 추론을 향상시킬 수 있다

초록

다중모드 대형 언어 모델(MLLMs)의 급속한 발전에도 불구하고, 이들은 시각적 처리의 중요성을 크게 간과해 왔다. 간단하지만 의미 있는 실험에서, 우리는 흥미롭게도 이미지 캡션만 제공된 언어 전용 모델이 원시 시각적 입력을 처리하는 MLLMs와 비슷하거나 더 나은 성능을 달성할 수 있음을 발견했다. 이는 현재의 MLLMs가 정확한 시각적 설명을 생성할 수는 있지만, 이를 추론 과정에서 효과적으로 통합하지 못할 가능성을 시사한다. 이를 계기로, 우리는 알고리즘 수정이나 추가 학습 데이터 없이도 지각적 견고성을 향상시키는 간단한 시각적 교란 프레임워크를 제안한다. 우리의 접근 방식은 SFT, DPO, GRPO와 같은 기존의 사후 학습 파이프라인에 쉽게 통합할 수 있는 세 가지 목표 교란 전략을 도입한다: 방해 요소 연결, 우위 유지 혼합, 무작위 회전. 여러 데이터셋에 걸친 광범위한 실험을 통해, 우리는 수학적 추론 성능에서 일관된 개선을 보여주었으며, 이는 알고리즘 변경을 통해 달성된 성능 향상과 비슷한 수준이었다. 또한, 우리는 Qwen2.5-VL-7B 모델에 시각적 교란을 적용하여 학습함으로써 오픈소스 7B RL 튜닝 모델 중에서 경쟁력 있는 성능을 달성했다. 포괄적인 제거 연구를 통해, 우리는 다양한 교란 전략의 효과를 분석했으며, 각 교란 유형이 시각적 추론의 다른 측면에 독특하게 기여함을 밝혀냈다. 우리의 연구 결과는 다중모드 수학적 추론에서 시각적 교란의 중요한 역할을 강조한다: 더 나은 추론은 더 나은 시각에서 시작된다. 우리의 코드는 https://github.com/YutingLi0606/Vision-Matters에서 확인할 수 있다.

English

Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.

시각의 중요성: 단순한 시각적 교란이 다중모드 수학 추론을 향상시킬 수 있다

Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

초록

Support