시각적 추론: 텍스트를 넘어서는 표현적 추론 매체로서 이미지 재고하기

초록

사고 연쇄(Chain-of-Thought, CoT)는 대규모 언어 모델(Large Language Models, LLMs)의 성능을 향상시키며, 다중 모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)로 확장되었다. 최근 연구는 텍스트 기반의 다중 모달 추론에서 나아가 중간 단계에서 텍스트적 근거와 시각적 증거를 모두 통합할 수 있는 교차 모달 추론(interleaved-modal reasoning)으로 발전하고 있다. 본 연구에서는 보다 대담하고 야심찬 아이디어를 제안한다: 이미지 단독으로 언어 및 다중 모달 작업 모두의 추론 매개체 역할을 할 수 있는가? 이를 탐구하기 위해, 우리는 이미지를 독립적인 추론 매개체로 간주하는 광학적 추론(optical reasoning)을 제안한다. 이 개념을 두 가지 변형으로 구체화한다: 컴팩트한 근거 표현을 위해 시각적 레이아웃을 최적화하는 타이포그래픽 기반 광학 추론(typographic-based optical reasoning)과, 텍스트 및 그래픽 요소를 구조화된 시각적 근거로 구성하는 그래픽 기반 광학 추론(graphical-based optical reasoning)이다. 수학, 과학, 교차 모달 추론 벤치마크에서 광학 추론은 전통적인 텍스트 추론과 동등하거나 그 이상의 성능을 보이면서, 언어 작업에서 평균 28.57%, 다중 모달 작업에서 16%의 추론 토큰을 감소시켜 텍스트 추론 대비 1.96배의 토큰 효율성을 달성했다. 이러한 결과는 이미지가 근거를 효과적이고 효율적으로 인코딩하는 동시에 추론을 위한 통합된 시각적 캔버스를 제공할 수 있음을 보여준다.

English

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.