一種從粗到精的多模態3D佔位標定方法

摘要

視覺定位旨在根據自然語言描述識別場景中的物體或區域，這對於自動駕駛中的空間感知至關重要。然而，現有的視覺定位任務通常依賴於邊界框，這些邊界框往往無法捕捉到細粒度的細節。邊界框內的所有體素並非都被佔用，導致物體表示不準確。為解決這一問題，我們引入了一個針對挑戰性戶外場景的三維佔用定位基準。該基準基於nuScenes數據集，將自然語言與體素級別的佔用註釋相結合，相比傳統的定位任務，提供了更精確的物體感知。此外，我們提出了GroundingOcc，這是一個專為三維佔用定位設計的端到端模型，通過多模態學習從粗到細預測物體位置和佔用信息。具體而言，GroundingOcc包含一個用於特徵提取的多模態編碼器、一個用於體素級預測的佔用頭，以及一個用於精細定位的定位頭。此外，一個二維定位模塊和一個深度估計模塊增強了幾何理解，從而提升了模型性能。在基準上的大量實驗表明，我們的方法在三維佔用定位上優於現有的基線。數據集可在https://github.com/RONINGOD/GroundingOcc獲取。

English

Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

一種從粗到精的多模態3D佔位標定方法

A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

摘要

Support