マスクドシーンモデリング：3Dシーン理解における教師あり学習と自己教師あり学習のギャップを縮める

要旨

自己教師あり学習は、大規模な未注釈データセットで訓練されたモデルが、ラベル付きで訓練されたモデルと同等の性能を発揮する汎用的なオフ・ザ・シェルフ特徴を提供することで、2Dコンピュータビジョンを変革してきました。しかし、3Dシーン理解においては、自己教師あり手法は通常、タスク固有のファインチューニングのための重み初期化ステップとしてのみ使用され、汎用特徴抽出の有用性が制限されています。本論文はこの欠点に対処するため、3Dシーン理解のための自己教師あり特徴の品質を評価するために特別に設計された堅牢な評価プロトコルを提案します。私たちのプロトコルは、階層的モデルの多解像度特徴サンプリングを使用して、モデルの意味的機能を捉えたリッチなポイントレベル表現を作成し、線形プロービングや最近傍法による評価に適したものとします。さらに、線形プロービング設定でオフ・ザ・シェルフ特徴のみを使用した場合に、教師ありモデルと同等の性能を発揮する最初の自己教師ありモデルを紹介します。特に、私たちのモデルは、Masked Scene Modeling目的に基づく新しい自己教師ありアプローチでネイティブに3Dで訓練され、階層的3Dモデルに特化して、マスクされたパッチの深層特徴をボトムアップ方式で再構築します。私たちの実験は、私たちの手法が教師ありモデルと競合する性能を達成するだけでなく、既存の自己教師ありアプローチを大幅に上回ることを示しています。モデルと訓練コードは、私たちのGithubリポジトリ（https://github.com/phermosilla/msm）で見つけることができます。

English

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).