가볍게 보고, 깊이 생각하다: 멀티모달 연쇄 사고 추론이 할 수 있는 것과 할 수 없는 것

초록

Chain-of-Thought(CoT)는 단계적 사고를 유도하여 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 표준 방법이 되었지만, 다중 모드 작업에서의 효과는 여전히 불분명하다. 본 논문에서는 핵심 질문인 "다중 모드 Chain-of-Thought 추론이 무엇을 할 수 있으며, 어디서 왜 부족한가?"를 체계적으로 조사하고자 한다. 이를 위해 지각 및 추론 범주에 걸쳐 12개의 다중 모드 작업을 14개의 비추론 모델과 8개의 추론 모델을 사용하여 평가한다. 분석 결과 몇 가지 중요한 발견이 드러났다: (1) CoT는 공짜 점심이 아니며, 각 작업의 특정 요구 사항에 따라 선택적으로 사용해야 한다. 지각 작업의 경우 CoT는 시각적 접지 및 객체 계수에서 성능 저하와 같은 바람직하지 않은 부작용을 초래할 수 있다. 반대로 수학, 과학 및 다중 이미지 추론 관련 추론 작업에는 효과적이다. (2) 기존 모델과 비교할 때, 기존 오픈소스 다중 모드 추론 모델은 종종 전반적으로 미미한 개선만을 보이는데, 이는 아마도 수학적 추론에 지나치게 집중하여 더 넓은 능력을 희생했기 때문일 수 있다. (3) 시각적 추론은 현재 다중 모드 CoT의 주요 병목 현상으로 남아 있으며, 모델은 Look Light, Think Heavy(가볍게 보고 깊이 생각함) 패턴을 보인다. 이 패턴에서는 언어적 반성은 추론 과정에서 오르내리지만, 시각적 반성은 지속적으로 감소한다. 이러한 발견은 다중 모드 CoT가 언어적 반성은 비교적 잘 처리하지만, 추론 과정 전반에 걸쳐 깊은 시각적 내성을 유지하는 능력이 부족함을 시사한다.

English

Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.