基於文本的稀疏體素修剪，用於高效的3D視覺定位

摘要

本文提出了一種用於3D視覺定位的高效多層級卷積架構。傳統方法由於雙階段或基於點的架構而難以滿足實時推理的要求。受到3D物體檢測中多層級完全稀疏卷積架構成功的啟發，我們旨在構建一個新的3D視覺定位框架，遵循這一技術路線。然而，在3D視覺定位任務中，由於體素特徵的數量龐大，基於稀疏卷積的架構對於與文本特徵的深度交互效率低下。為此，我們提出了文本引導的修剪（TGP）和基於完成的添加（CBA）方法，通過逐步區域修剪和目標完成，以高效地深度融合3D場景表示和文本特徵。具體而言，TGP通過迭代地稀疏化3D場景表示，從而通過交叉注意力高效地將體素特徵與文本特徵進行交互。為了減輕修剪對精細幾何信息的影響，CBA通過體素完成自適應地修復過度修剪的區域，並且計算開銷幾乎可以忽略不計。與先前的單階段方法相比，我們的方法實現了頂級推理速度，並且超過了以前最快方法100％的FPS。我們的方法在準確性方面也達到了最新水平，即使與雙階段方法相比，也在ScanRefer上的Acc@0.5上領先1.13，並分別在NR3D和SR3D上領先2.6和3.2。代碼可在以下鏈接找到：https://github.com/GWxuan/TSP3D{https://github.com/GWxuan/TSP3D}。

English

In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with +1.13 lead of Acc@0.5 on ScanRefer, and +2.6 and +3.2 leads on NR3D and SR3D respectively. The code is available at https://github.com/GWxuan/TSP3D{https://github.com/GWxuan/TSP3D}.

基於文本的稀疏體素修剪，用於高效的3D視覺定位

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

摘要

Support