解釋在先，回答在後：組合式視覺推理綜述

摘要

組合式視覺推理已成為多模態人工智慧領域的關鍵研究前沿，旨在賦予機器類似人類的能力，能夠分解視覺場景、定位中間概念並進行多步邏輯推理。儘管早期的綜述聚焦於單一的視覺-語言模型或一般的多模態推理，但對於快速擴展的組合式視覺推理文獻，仍缺乏專門的綜合整理。我們通過一項涵蓋2023至2025年的全面調查填補了這一空白，系統性地回顧了來自頂級會議（如CVPR、ICCV、NeurIPS、ICML、ACL等）的260多篇論文。我們首先形式化了核心定義，並闡述了組合式方法在認知對齊、語義保真度、魯棒性、可解釋性和數據效率方面的優勢。接著，我們追溯了五個階段的範式轉變：從提示增強以語言為中心的管道，到工具增強的大型語言模型（LLMs）和工具增強的多模態視覺語言模型（VLMs），再到近期提出的思維鏈推理和統一代理多模態視覺語言模型，重點介紹了它們的架構設計、優勢與局限。隨後，我們分類整理了60多個基準測試及相應的指標，這些測試從定位準確性、思維鏈忠實度和高分辨率感知等維度探討了組合式視覺推理。基於這些分析，我們提煉出關鍵見解，指出了開放性挑戰（例如基於LLM推理的局限性、幻覺問題、偏向演繹推理、可擴展的監督、工具整合及基準測試的局限性），並勾勒了未來方向，包括世界模型整合、人機協作推理以及更豐富的評估協議。通過提供統一的分類體系、歷史路線圖和批判性展望，本調查旨在作為基礎性參考，並激發下一代組合式視覺推理研究的靈感。

English

Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.

解釋在先，回答在後：組合式視覺推理綜述

Explain Before You Answer: A Survey on Compositional Visual Reasoning

摘要

Support