ODIN:一种用于2D和3D感知的单一模型
ODIN: A Single Model for 2D and 3D Perception
January 4, 2024
作者: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki
cs.AI
摘要
当代3D感知基准数据集(如ScanNet)上的最先进模型消耗并标记通过感知到的多视角RGB-D图像的后处理获得的数据集提供的3D点云。它们通常在领域内进行训练,放弃大规模的2D预训练,并且胜过通过特征化姿态RGB-D多视角图像的替代方法。消耗姿态图像与后处理的3D点云之间性能差距推动了这样一种观念,即2D和3D感知需要不同的模型架构。在本文中,我们挑战这种观点,并提出ODIN(Omni-Dimensional INstance segmentation),这是一个能够使用交替2D内视和3D交叉视信息融合的变压器架构对2D RGB图像和3D点云进行分割和标记的模型。我们的模型通过涉及的令牌的位置编码区分2D和3D特征操作,这些位置编码捕获2D补丁令牌的像素坐标和3D特征令牌的3D坐标。ODIN在ScanNet200、Matterport3D和AI2THOR 3D实例分割基准上实现了最先进的性能,并在ScanNet、S3DIS和COCO上实现了竞争性能。当感知到的3D点云代替从3D网格采样的点云时,它比所有先前的工作表现出更大的优势。当作为可指导的具身代理架构中的3D感知引擎时,它在TEACh基于对话的动作基准上树立了新的最先进水平。我们的代码和检查点可以在项目网站找到:https://odin-seg.github.io。
English
State-of-the-art models on contemporary 3D perception benchmarks like ScanNet
consume and label dataset-provided 3D point clouds, obtained through post
processing of sensed multiview RGB-D images. They are typically trained
in-domain, forego large-scale 2D pre-training and outperform alternatives that
featurize the posed RGB-D multiview images instead. The gap in performance
between methods that consume posed images versus post-processed 3D point clouds
has fueled the belief that 2D and 3D perception require distinct model
architectures. In this paper, we challenge this view and propose ODIN
(Omni-Dimensional INstance segmentation), a model that can segment and label
both 2D RGB images and 3D point clouds, using a transformer architecture that
alternates between 2D within-view and 3D cross-view information fusion. Our
model differentiates 2D and 3D feature operations through the positional
encodings of the tokens involved, which capture pixel coordinates for 2D patch
tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art
performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation
benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It
outperforms all previous works by a wide margin when the sensed 3D point cloud
is used in place of the point cloud sampled from 3D mesh. When used as the 3D
perception engine in an instructable embodied agent architecture, it sets a new
state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and
checkpoints can be found at the project website: https://odin-seg.github.io.