글로벌 맵과 로컬 뷰를 이용한 다중 시점 3D 추론을 위한 밀집 보상

초록

다중 시점 3D 시각 질의응답(MV3D-VQA)은 부분 관측을 일관된 3D 장면 표현으로 통합하고 다단계 공간 추론을 위해 정보가 풍부한 시점을 선택해야 한다. 그러나 현재의 다중 모달 대형 언어 모델(LLM)은 일반적으로 희소한 답변 수준의 지도 학습으로 훈련되어, 시점 간 추론의 일관성과 취약한 시점 선택을 초래하는 경우가 많다. 본 논문에서는 추론 과정을 감독하기 위해 조밀하고 검증 가능한 보상을 제공하는 지도 기반 학습 프레임워크인 DR-MV3D(Dense Reward for MV3D-VQA)를 제시한다. 우리의 접근 방식은 MV3D-VQA를 (i) 이심적 전역 지도 구축, (ii) 질문 조건부 시점-궤적 계획, (iii) 답변 예측을 위한 자아중심적 근거화로 분해한다. 수동 주석 없이 중간 단계를 학습 가능하게 만들기 위해, 예측된 지도를 고정된 3D 비전 기반 모델(예: VGGT + SAM3)의 기하학적으로 일관된 의사 목표와 정렬하는 전역 일관성 보상과 순서가 있는 시점 선택을 감독하는 국소 궤적 보상이라는 두 가지 보상을 도입한다. 전체 파이프라인을 궤적 수준 정책 최적화(GRPO)로 최적화한다. MindCube, VSI-Bench 및 BLINK(MV)에 대한 실험은 DR-MV3D가 강력한 다중 이미지 기준선보다 일관되게 성능을 향상시켜 다중 시점 3D 추론을 위한 과정 수준의 조밀한 지도 학습의 효과를 뒷받침함을 보여준다.

English

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.