使用全局地图与局部视图的多视图三维推理密集奖励

摘要

多视图3D视觉问答（MV3D-VQA）需要将局部观测整合为一致的3D场景表示，并选择信息量丰富的视角以进行多步空间推理。然而，当前的多模态大语言模型通常使用稀疏的答案级监督进行训练，这往往导致跨视角推理不一致以及视角选择不稳健。我们提出DR-MV3D（面向MV3D-VQA的密集奖励），一种基于地图引导的学习框架，通过提供密集且可验证的奖励来监督推理过程。该方法将MV3D-VQA分解为：（i）异中心全局地图构建，（ii）基于问题的视角轨迹规划，以及（iii）用于答案预测的自我中心定位。为了在不依赖人工标注的情况下使中间步骤可学习，我们引入了两种奖励：全局一致性奖励，用于将预测地图与来自冻结3D视觉基础模型（如VGGT+SAM3）的几何一致伪目标对齐；以及局部轨迹奖励，用于监督有序的视角选择。我们通过轨迹级策略优化（GRPO）对整个流程进行优化。在MindCube、VSI-Bench和BLINK（MV）上的实验表明，DR-MV3D在强多图像基线上持续取得改进，验证了过程级密集监督对多视图3D推理的有效性。

English

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.