ChatPaper.aiChatPaper

使用全局地图与局部视图的多视图三维推理密集奖励

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

June 22, 2026
作者: Jiho Choi, Seonho Lee, Seojeong Park, Hyunjung Shim
cs.AI

摘要

多视图3D视觉问答(MV3D-VQA)需要将局部观测整合为一致的3D场景表示,并选择信息量丰富的视角以进行多步空间推理。然而,当前的多模态大语言模型通常使用稀疏的答案级监督进行训练,这往往导致跨视角推理不一致以及视角选择不稳健。我们提出DR-MV3D(面向MV3D-VQA的密集奖励),一种基于地图引导的学习框架,通过提供密集且可验证的奖励来监督推理过程。该方法将MV3D-VQA分解为:(i)异中心全局地图构建,(ii)基于问题的视角轨迹规划,以及(iii)用于答案预测的自我中心定位。为了在不依赖人工标注的情况下使中间步骤可学习,我们引入了两种奖励:全局一致性奖励,用于将预测地图与来自冻结3D视觉基础模型(如VGGT+SAM3)的几何一致伪目标对齐;以及局部轨迹奖励,用于监督有序的视角选择。我们通过轨迹级策略优化(GRPO)对整个流程进行优化。在MindCube、VSI-Bench和BLINK(MV)上的实验表明,DR-MV3D在强多图像基线上持续取得改进,验证了过程级密集监督对多视图3D推理的有效性。
English
Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.