ChatPaper.aiChatPaper

PhysVLM-AVR:物理环境中多模态大语言模型的主动视觉推理

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

October 24, 2025
作者: Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang
cs.AI

摘要

多模态大语言模型(MLLMs)的视觉推理研究主要集中于静态、全观测场景,这限制了其在现实环境中的有效性——现实中因遮挡或视野受限常导致信息不完整。与之相反,人类通过整合感知、推理与行动的闭环过程,主动探索并与环境互动(如移动、检视和操控物体)来收集信息。受此人类能力启发,我们提出主动视觉推理(AVR)任务,将视觉推理扩展至部分可观测的交互式环境。AVR要求智能体具备以下能力:(1)通过序列化物理动作主动获取信息;(2)整合多步观测以进行连贯推理;(3)基于动态视觉反馈实时调整决策。为系统评估AVR,我们推出仿真基准CLEVR-AVR,其多轮交互环境可同时评估推理正确性与信息获取效率。我们构建的大规模数据集AVR-152k提供丰富的思维链标注,详细阐释不确定性识别中的迭代推理、动作条件化信息增益预测及信息最大化动作选择,这对训练高阶马尔可夫决策过程中的智能体至关重要。基于此,我们开发了PhysVLM-AVR模型,该MLLM在CLEVR-AVR、具身推理(OpenEQA、RoboVQA)及被动视觉推理(GeoMath、Geometry30K)任务中均达到最先进性能。分析还表明,当前具身MLLM虽能检测信息不完整性,却难以通过交互主动获取并整合新信息,这揭示了主动推理能力的根本性缺陷。
English
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.
PDF21December 17, 2025