N3D-VLM：原生三维空间定位赋能视觉语言模型的精确空间推理

摘要

当前的多模态模型虽能基于二维图像回答问题，却缺乏对三维物体的本质感知能力，这限制了其理解三维场景中空间关系与深度信息的能力。本研究提出N3D-VLM这一创新统一框架，将原生三维物体感知与三维视觉推理无缝融合，既能实现精确的三维定位，又可达成可解释的空间理解。相较于传统端到端模型直接通过RGB/RGB-D输入预测答案的方法，我们的方案赋予模型原生三维物体感知能力，使其能根据文本描述直接在三维空间中定位物体。基于精准的三维物体定位，模型进一步在三维空间进行显式推理，实现更具可解释性和结构化的空间理解。为支撑这些能力的稳健训练，我们开发了可扩展的数据构建流程，通过深度估计将大规模二维标注提升至三维空间，显著增强三维物体定位数据的多样性和覆盖范围，生成的数据集规模达到现有最大单图像三维检测数据集的六倍以上。该流程还生成了针对三维思维链推理的空间问答数据集，为三维物体定位与空间推理的联合训练提供支持。实验结果表明，我们的统一框架不仅在三维定位任务上达到最先进性能，在视觉语言模型的三维空间推理方面也持续超越现有方法。

English

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

N3D-VLM：原生三维空间定位赋能视觉语言模型的精确空间推理

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

摘要

Support