前馈式场景DINO用于无监督语义场景补全

摘要

語義場景補全（Semantic Scene Completion, SSC）旨在從單一圖像推斷場景的三維幾何結構與語義信息。與以往依賴於昂貴的真實標註的SSC研究不同，我們在無監督的設定下探索SSC。我們的新方法SceneDINO，借鑑了自監督表示學習與二維無監督場景理解技術，將其應用於SSC。我們的訓練僅利用多視圖一致性自監督，無需任何形式的語義或幾何真實標註。給定單一輸入圖像，SceneDINO以前饋方式推斷三維幾何結構及富有表現力的三維DINO特徵。通過一種新穎的三維特徵蒸餾方法，我們獲得了無監督的三維語義。在無監督的三維與二維場景理解任務中，SceneDINO均達到了最先進的分割精度。對我們的三維特徵進行線性探測，其分割精度可與當前有監督的SSC方法相媲美。此外，我們展示了SceneDINO在領域泛化與多視圖一致性方面的能力，為單圖像三維場景理解奠定了初步的堅實基礎。

English

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.