ChatPaper.aiChatPaper

解释在先,回答在后:组合式视觉推理研究综述

Explain Before You Answer: A Survey on Compositional Visual Reasoning

August 24, 2025
作者: Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi
cs.AI

摘要

组合视觉推理已成为多模态AI领域的关键研究前沿,旨在赋予机器类人的能力,使其能够分解视觉场景、锚定中间概念并执行多步逻辑推理。尽管早期的综述聚焦于单一视觉语言模型或通用多模态推理,但针对快速扩展的组合视觉推理文献的专门综合仍显缺失。我们通过一项涵盖2023至2025年的全面调查填补了这一空白,系统回顾了来自顶级会议(如CVPR、ICCV、NeurIPS、ICML、ACL等)的260余篇论文。首先,我们形式化了核心定义,并阐述了组合方法在认知对齐、语义保真度、鲁棒性、可解释性和数据效率方面的优势。接着,我们追溯了五个阶段的范式转变:从提示增强的语言中心管道,到工具增强的大语言模型(LLMs)和视觉语言模型(VLMs),再到近期兴起的思维链推理和统一代理VLMs,重点分析了它们的架构设计、优势与局限。随后,我们分类整理了60多个基准及其对应指标,这些指标从锚定准确性、思维链忠实度到高分辨率感知等多个维度考察组合视觉推理。基于这些分析,我们提炼出关键见解,识别出开放挑战(如基于LLM推理的局限性、幻觉问题、偏向演绎推理、可扩展监督、工具集成及基准限制),并展望了未来方向,包括世界模型整合、人机协作推理及更丰富的评估协议。通过提供一个统一的分类体系、历史路线图及批判性展望,本调查旨在成为基础性参考,并激发下一代组合视觉推理研究的灵感。
English
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
PDF21August 26, 2025