Dichte Beloning voor Multi-View 3D Redenering met Globale Kaarten en Lokale Aanzichten

Samenvatting

Multi-view 3D Visuele Vraagbeantwoording (MV3D-VQA) vereist de integratie van partiële waarnemingen in een coherente 3D-scènerepresentatie en de selectie van informatieve aanzichten voor meerstaps ruimtelijk redeneren. Huidige multimodale LLM's worden echter doorgaans getraind met schaarse supervisie op antwoordniveau, wat vaak leidt tot inconsistente kruisingsredeneringen en breekbare aanzichtselectie. Wij presenteren DR-MV3D (Dense Reward voor MV3D-VQA), een kaartverankerd leerkader dat dichte, verifieerbare beloningen biedt om het redeneerproces te superviseren. Onze aanpak ontleedt MV3D-VQA in (i) allocentrische globale kaartconstructie, (ii) vraag-afhankelijke aanzichttrajectplanning en (iii) egocentrische verankering voor antwoordpredictie. Om tussenstappen leerbaar te maken zonder handmatige annotaties introduceren we twee beloningen: een globale consistentiebeloning die de voorspelde kaart afstemt op geometrisch consistente pseudo-doelen van bevroren 3D-visie-fundatiemodellen (bijv. VGGT + SAM3), en een lokaal trajectbeloning die de geordende selectie van aanzichten superviseren. We optimaliseren de volledige pijplijn met trajectniveau-beleidsoptimalisatie (GRPO). Experimenten op MindCube, VSI-Bench en BLINK (MV) tonen aan dat DR-MV3D consequent beter presteert dan sterke multi-beeld-baselines, wat de effectiviteit ondersteunt van procesniveau dichte supervisie voor multi-view 3D-redeneren.

English

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.