답변 전 설명: 구성적 시각 추론에 관한 연구 조사

초록

구성적 시각 추론(compositional visual reasoning)은 다중모드 AI의 주요 연구 분야로 부상하며, 기계가 인간과 유사한 방식으로 시각 장면을 분해하고 중간 개념을 기반으로 다단계 논리적 추론을 수행할 수 있는 능력을 갖추는 것을 목표로 합니다. 초기 연구들은 단일체적(single monolithic) 시각-언어 모델이나 일반적인 다중모드 추론에 초점을 맞추었지만, 빠르게 확장되고 있는 구성적 시각 추론 문헌에 대한 전용 종합 연구는 아직 부족한 상태입니다. 우리는 2023년부터 2025년까지의 260편 이상의 주요 학회(CVPR, ICCV, NeurIPS, ICML, ACL 등) 논문을 체계적으로 검토한 포괄적인 연구를 통해 이 격차를 메웁니다. 먼저, 핵심 정의를 공식화하고 구성적 접근 방식이 인지적 정렬(cognitive alignment), 의미론적 충실도(semantic fidelity), 견고성(robustness), 해석 가능성(interpretability), 데이터 효율성(data efficiency) 측면에서 왜 우수한지를 설명합니다. 다음으로, 프롬프트 강화 언어 중심 파이프라인에서 도구 강화 LLM, 도구 강화 VLM을 거쳐 최근의 사고 연쇄(chain-of-thought) 추론과 통합 에이전트형 VLM에 이르는 다섯 단계의 패러다임 전환을 추적하며, 각각의 아키텍처 설계, 강점 및 한계를 강조합니다. 이후, 우리는 기반 정확도(grounding accuracy), 사고 연쇄 충실도(chain-of-thought faithfulness), 고해상도 인식(high-resolution perception) 등 다양한 차원에서 구성적 시각 추론을 탐구하는 60개 이상의 벤치마크와 해당 메트릭을 분류합니다. 이러한 분석을 바탕으로, 우리는 주요 통찰을 도출하고 개방형 과제(예: LLM 기반 추론의 한계, 환각(hallucination), 연역적 추론에 대한 편향, 확장 가능한 감독, 도구 통합, 벤치마크 한계 등)를 식별하며, 세계 모델 통합(world-model integration), 인간-AI 협업 추론, 더 풍부한 평가 프로토콜을 포함한 미래 방향을 제시합니다. 통합 분류 체계, 역사적 로드맵, 비판적 전망을 제공함으로써, 이 연구는 구성적 시각 추론 연구의 기초 참고 자료로 기능하고 다음 세대의 연구를 영감으로 삼고자 합니다.

English

Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.