다중모드 추론을 위한 이미지 사고: 기초, 방법론, 그리고 미래의 프론티어

초록

최근 멀티모달 추론 분야의 발전은 텍스트 기반 사고의 연쇄(Chain-of-Thought, CoT) 패러다임에 의해 크게 촉진되었습니다. 이는 모델이 언어 내에서 추론을 수행하는 방식입니다. 그러나 이러한 텍스트 중심 접근법은 시각 정보를 정적인 초기 맥락으로 취급함으로써, 풍부한 지각 데이터와 이산적인 상징적 사고 사이에 근본적인 "의미론적 간극"을 만들어 냅니다. 인간의 인지는 종종 언어를 초월하여 시각을 동적인 정신 스케치패드로 활용합니다. 이와 유사한 진화가 이제 AI에서도 일어나고 있으며, 단순히 이미지에 대해 생각하는 모델에서 진정으로 이미지와 함께 생각할 수 있는 모델로의 근본적인 패러다임 전환을 이루고 있습니다. 이 새로운 패러다임은 모델이 시각 정보를 사고 과정의 중간 단계로 활용함으로써, 시각을 수동적인 입력에서 동적이고 조작 가능한 인지 작업 공간으로 변모시키는 특징을 갖습니다. 본 조사에서는 이러한 지능의 진화를 인지적 자율성이 증가하는 궤적을 따라 세 가지 주요 단계로 나누어 살펴봅니다: 외부 도구 탐색, 프로그래밍적 조작, 내재적 상상력. 이 빠르게 진화하는 분야를 체계화하기 위해, 본 조사는 네 가지 주요 기여를 합니다. (1) 이미지와 함께 생각하는 패러다임의 기본 원리와 세 단계 프레임워크를 정립합니다. (2) 이 로드맵의 각 단계를 특징짓는 핵심 방법론에 대한 포괄적인 검토를 제공합니다. (3) 평가 벤치마크와 변혁적 애플리케이션의 중요한 지형을 분석합니다. (4) 주요 도전 과제를 식별하고 유망한 미래 방향을 제시합니다. 이러한 구조화된 개요를 통해, 우리는 더 강력하고 인간과 조화된 멀티모달 AI를 향한 미래 연구를 위한 명확한 로드맵을 제공하고자 합니다.

English

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.