一种从粗到细的多模态3D占用空间定位方法

摘要

视觉定位旨在基于自然语言描述识别场景中的物体或区域，这对于自动驾驶中的空间感知至关重要。然而，现有的视觉定位任务通常依赖于边界框，这些边界框往往无法捕捉到细粒度的细节。边界框内的所有体素并非都被占据，导致物体表示不准确。为解决这一问题，我们引入了一个针对复杂户外场景的三维占据定位基准。该基准建立在nuScenes数据集之上，将自然语言与体素级占据标注相结合，相比传统定位任务提供了更精确的物体感知。此外，我们提出了GroundingOcc，一个专为三维占据定位设计的端到端模型，通过多模态学习实现从粗到细的物体位置和占据信息预测。具体而言，GroundingOcc包含一个用于特征提取的多模态编码器、一个用于体素级预测的占据头，以及一个用于精确定位的定位头。此外，一个二维定位模块和一个深度估计模块增强了几何理解，从而提升了模型性能。在基准上的大量实验表明，我们的方法在三维占据定位上优于现有基线。数据集可在https://github.com/RONINGOD/GroundingOcc获取。

English

Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

一种从粗到细的多模态3D占用空间定位方法

A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

摘要

Support