ODIN: 2Dおよび3D知覚のための単一モデル

要旨

最先端の3D知覚ベンチマーク（例：ScanNet）における最新モデルは、センシングされた多視点RGB-D画像の後処理によって得られたデータセット提供の3D点群を消費し、ラベル付けを行います。これらのモデルは通常、ドメイン内でトレーニングされ、大規模な2D事前学習を省略し、代わりにポーズ付きRGB-D多視点画像を特徴量化する代替手法を凌駕します。ポーズ付き画像を消費する手法と後処理された3D点群を消費する手法の間の性能差は、2Dと3D知覚には異なるモデルアーキテクチャが必要であるという信念を助長してきました。本論文では、この見解に異議を唱え、2D RGB画像と3D点群の両方をセグメント化およびラベル付けできるモデルであるODIN（Omni-Dimensional INstance segmentation）を提案します。ODINは、2Dビュー内情報と3Dクロスビュー情報の融合を交互に行うトランスフォーマーアーキテクチャを使用します。我々のモデルは、2Dパッチトークンにはピクセル座標を、3D特徴トークンには3D座標をキャプチャするトークンの位置エンコーディングを通じて、2Dと3Dの特徴操作を区別します。ODINは、ScanNet200、Matterport3D、AI2THORの3Dインスタンスセグメンテーションベンチマークで最先端の性能を達成し、ScanNet、S3DIS、COCOでも競争力のある性能を発揮します。3Dメッシュからサンプリングされた点群の代わりにセンシングされた3D点群を使用した場合、これまでのすべての研究を大幅に上回ります。指示可能なエンボディエージェントアーキテクチャの3D知覚エンジンとして使用した場合、TEAChの対話からのアクションベンチマークで新たな最先端を確立します。我々のコードとチェックポイントは、プロジェクトウェブサイト（https://odin-seg.github.io）で公開されています。

English

State-of-the-art models on contemporary 3D perception benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website: https://odin-seg.github.io.

ODIN: 2Dおよび3D知覚のための単一モデル

ODIN: A Single Model for 2D and 3D Perception

要旨

Support