시각에서 사고로: 지각과 추론의 분리가 시각-언어 모델의 사후 훈련을 개선한다

초록

최근 시각-언어 모델(VLM)의 발전은 긴 사고 사슬 추론을 강조하고 있지만, 본 연구에서는 시각적 과제에서의 성능이 추론 자체보다는 시각 지각의 부족에 의해 주로 제한된다는 점을 발견한다. 본 연구에서는 VLM 후속 학습에서 지각과 추론 간의 상호작용을 체계적으로 분석하기 위해, 이들의 능력을 시각 지각, 시각 추론, 텍스트 추론의 세 가지 별도 학습 단계로 분해하고 각각에 특화된 학습 데이터를 도입한다. 우리는 시각 지각이 (a) 특화된 데이터를 통한 목표 지향적 최적화를 필요로 하며, (b) 시각 추론을 정교화하기 전에 단계적 학습을 통해 견고하게 다져야 하는 기본적인 기반 역할을 하고, (c) 캡션 기반 지도 미세 조정보다 강화 학습을 통해 더 효과적으로 학습된다는 것을 입증한다. 여러 VLM에 걸친 실험 결과, 단계적 학습은 병합 학습에 비해 시각 지각과 추론 성능을 일관되게 향상시킨다. 특히, 본 접근법으로 학습된 모델은 20.8% 더 짧은 추론 과정에서 1.5% 더 높은 추론 정확도를 달성하는데, 이는 우수한 지각이 과도한 추론의 필요성을 줄여준다는 것을 시사한다. 또한, 이러한 능력 기반의 단계 구분은 기존의 난이도 기반 교육 과정과는 직교하는 새로운 교육 과정 차원을 나타내며, 두 가지를 결합하면 추가적인 상승 효과를 얻을 수 있음을 보여준다. 본 단계적 학습 모델은 공개 가중치 VLM 중에서 우수한 성능을 달성하며, 기본 모델 대비 여러 시각 수학 및 지각 과제(예: WeMath에서 +5.2%, RealWorldQA에서 +3.7%)에서 향상된 결과를 확립한다.

English

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.