PhysVLM-AVR:面向物理环境的多模态大语言模型主动视觉推理
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments
October 24, 2025
作者: Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang
cs.AI
摘要
多模态大语言模型(MLLM)的视觉推理研究目前主要集中于静态、全观测场景,这限制了其在现实环境中的有效性——现实环境中信息常因遮挡或视野受限而不完整。与之相反,人类会通过整合感知、推理与行动的闭环过程,主动探索并与环境互动(包括移动、检视和操控物体)来收集信息。受此人类能力启发,我们提出主动视觉推理(AVR)任务,将视觉推理扩展至部分可观测的交互式环境。AVR要求智能体具备三种能力:(1)通过序列化物理动作主动获取信息;(2)整合多步骤观察以进行连贯推理;(3)根据动态视觉反馈实时调整决策。为系统评估AVR,我们开发了CLEVR-AVR仿真基准测试平台,其多轮交互环境可同步评估推理正确性与信息收集效率。我们提出包含15.2万样本的大规模数据集AVR-152k,该数据集提供丰富的思维链标注,详细阐释不确定性识别、行动条件化信息增益预测及信息最大化行动选择等迭代推理过程,这对训练高阶马尔可夫决策过程中的智能体至关重要。基于此,我们开发了PhysVLM-AVR模型,该MLLM在CLEVR-AVR、具身推理(OpenEQA、RoboVQA)及被动视觉推理(GeoMath、Geometry30K)任务中均达到最先进性能。分析还表明,当前具身MLLM虽能检测信息不完整性,却难以通过交互主动获取并整合新信息,这揭示了主动推理能力的本质缺陷。
English
Visual reasoning in multimodal large language models (MLLMs) has primarily
been studied in static, fully observable settings, limiting their effectiveness
in real-world environments where information is often incomplete due to
occlusion or limited field of view. Humans, in contrast, actively explore and
interact with their environment-moving, examining, and manipulating objects-to
gather information through a closed-loop process integrating perception,
reasoning, and action. Inspired by this human capability, we introduce the
Active Visual Reasoning (AVR) task, extending visual reasoning to partially
observable, interactive environments. AVR necessitates agents to: (1) actively
acquire information via sequential physical actions, (2) integrate observations
across multiple steps for coherent reasoning, and (3) dynamically adjust
decisions based on evolving visual feedback. To rigorously evaluate AVR, we
introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive
environments designed to assess both reasoning correctness and
information-gathering efficiency. We present AVR-152k, a large-scale dataset
that offers rich Chain-of-Thought (CoT) annotations detailing iterative
reasoning for uncertainty identification, action-conditioned information gain
prediction, and information-maximizing action selection, crucial for training
agents in a higher-order Markov Decision Process. Building on this, we develop
PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR,
embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath,
Geometry30K). Our analysis also reveals that current embodied MLLMs, despite
detecting information incompleteness, struggle to actively acquire and
integrate new information through interaction, highlighting a fundamental gap
in active reasoning capabilities.