멀티모달 LLM의 시공간 추론 능력에 대한 체인 오브 씽킹의 부정적 영향

초록

체인 오브 쏘트(CoT) 기반 사고를 활용한 다중모달 추론 모델(MRM)은 수학 및 논리적 문제 해결 분야에 혁명을 일으켰습니다. 그러나 본 연구에서는 이러한 패러다임이 일반화된 공간 지능 과제에는 취약함을 보여줍니다. 저희는 13개의 공간 벤치마크에서 17개 모델을 종합적으로 평가한 결과, 중요한 결격점을 확인했습니다: CoT 프롬프팅은 시각적 공간 추론 과제에서 지속적으로 성능을 저하시킵니다. 더 나아가, 새로운 No-Image++ 애블레이션 실험을 통해 MRM과 CoT가 적용된 MLM이 심각한 단축 학습(shortcut learning) 문제를 겪으며, 이미지가 없을 때도 텍스트 선행 지식(textual priors)에서 시각적 세부 사항을 환각(hallucinate)한다는 것을 입증했습니다. 이러한 연구 결과는 공간 과제에 대한 텍스트 단독 CoT의 효용성에 의문을 제기하며, 시각 중심(vision-centric) 추론 패러다임의 필요성을 강조합니다.

English

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

멀티모달 LLM의 시공간 추론 능력에 대한 체인 오브 씽킹의 부정적 영향

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

초록

Support