教師なし意味的シーン補完のためのフィードフォワードSceneDINO

要旨

セマンティックシーン補完（SSC）は、単一画像からシーンの3Dジオメトリとセマンティクスの両方を推論することを目的としています。これまでのSSC研究が高コストなグラウンドトゥルースアノテーションに大きく依存していたのに対し、我々は教師なし設定でSSCに取り組みます。我々の新しい手法であるSceneDINOは、自己教師あり表現学習と2D教師なしシーン理解の技術をSSCに適用します。我々のトレーニングは、セマンティックやジオメトリのグラウンドトゥルースを一切使用せず、マルチビュー一貫性による自己教師あり学習のみを利用します。単一の入力画像が与えられると、SceneDINOはフィードフォワード方式で3Dジオメトリと表現力豊かな3D DINO特徴を推論します。新しい3D特徴蒸留アプローチを通じて、教師なし3Dセマンティクスを取得します。3Dおよび2D教師なしシーン理解において、SceneDINOは最先端のセグメンテーション精度を達成します。我々の3D特徴を線形プローブすることで、現在の教師ありSSCアプローチのセグメンテーション精度に匹敵します。さらに、SceneDINOのドメイン汎化性とマルチビュー一貫性を示し、単一画像による3Dシーン理解の強固な基盤に向けた第一歩を踏み出します。

English

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.