N3D-VLM：原生三维定位赋能视觉语言模型实现精准空间推理

摘要

当前多模态模型虽能基于二维图像回答问题，但缺乏对三维物体的本质感知能力，这限制了其理解三维场景中空间关系与深度信息的能力。本研究提出N3D-VLM创新性统一框架，将原生三维物体感知与三维视觉推理无缝融合，既能实现精确的三维定位，又可达成可解释的空间理解。与直接从RGB/RGB-D输入预测答案的传统端到端模型不同，我们的方法赋予模型原生三维物体感知能力，使其能根据文本描述直接在三维空间中定位物体。在实现精准三维定位的基础上，模型进一步开展显式三维推理，获得更具可解释性和结构化的空间认知。为支撑这些能力的稳健训练，我们开发了可扩展的数据构建流程，通过深度估计将大规模二维标注提升至三维空间，使三维物体定位数据的多样性和覆盖范围显著提升，其规模达到现有最大单图像三维检测数据集的六倍以上。该流程还生成了针对三维思维链推理的空间问答数据集，为三维物体定位与空间推理的联合训练提供支持。实验结果表明，我们的统一框架不仅在三维定位任务上达到最先进性能，在视觉语言模型的三维空间推理方面也持续超越现有方法。

English

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

N3D-VLM：原生三维定位赋能视觉语言模型实现精准空间推理

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

摘要

Support