ChatPaper.aiChatPaper

ODIN:一個適用於2D和3D感知的單一模型

ODIN: A Single Model for 2D and 3D Perception

January 4, 2024
作者: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki
cs.AI

摘要

當前在像ScanNet這樣的當代3D感知基準上的最先進模型,會使用並標記由數據集提供的3D點雲,通過對感知的多視角RGB-D圖像進行後處理獲得。這些模型通常在領域內進行訓練,放棄大規模的2D預訓練,並且勝過對處理過的RGB-D多視角圖像進行特徵提取的替代方法。消耗處理過的圖像與消耗後處理過的3D點雲之間的性能差距,推動了一種觀念,即2D和3D感知需要不同的模型架構。在本文中,我們挑戰這一觀點,提出了ODIN(Omni-Dimensional INstance segmentation),一種能夠使用變壓器架構在2D RGB圖像和3D點雲上進行分割和標記的模型,該模型在2D視圖內部和3D視圖之間的信息融合之間進行交替。我們的模型通過所涉及的標記的位置編碼來區分2D和3D特徵操作,這些位置編碼捕捉了2D補丁標記的像素坐標和3D特徵標記的3D坐標。ODIN在ScanNet200、Matterport3D和AI2THOR 3D實例分割基準上實現了最先進的性能,並在ScanNet、S3DIS和COCO上實現了具有競爭力的性能。當感知到的3D點雲取代從3D網格採樣的點雲時,它勝過了所有先前的作品。當作為可指導的具體化代理架構中的3D感知引擎時,它在TEACh從對話中獲取動作基準上設立了新的最先進水平。我們的代碼和檢查點可在項目網站上找到:https://odin-seg.github.io。
English
State-of-the-art models on contemporary 3D perception benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website: https://odin-seg.github.io.
PDF131December 15, 2024