ChatPaper.aiChatPaper

前馈式场景DINO用于无监督语义场景补全

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

July 8, 2025
作者: Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers
cs.AI

摘要

語義場景補全(Semantic Scene Completion, SSC)旨在從單一圖像推斷場景的三維幾何結構與語義信息。與以往依賴於昂貴的真實標註的SSC研究不同,我們在無監督的設定下探索SSC。我們的新方法SceneDINO,借鑑了自監督表示學習與二維無監督場景理解技術,將其應用於SSC。我們的訓練僅利用多視圖一致性自監督,無需任何形式的語義或幾何真實標註。給定單一輸入圖像,SceneDINO以前饋方式推斷三維幾何結構及富有表現力的三維DINO特徵。通過一種新穎的三維特徵蒸餾方法,我們獲得了無監督的三維語義。在無監督的三維與二維場景理解任務中,SceneDINO均達到了最先進的分割精度。對我們的三維特徵進行線性探測,其分割精度可與當前有監督的SSC方法相媲美。此外,我們展示了SceneDINO在領域泛化與多視圖一致性方面的能力,為單圖像三維場景理解奠定了初步的堅實基礎。
English
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
PDF32July 9, 2025