前馈式场景DINO用于无监督语义场景补全
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
July 8, 2025
作者: Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers
cs.AI
摘要
語義場景補全(Semantic Scene Completion, SSC)旨在從單一圖像推斷場景的三維幾何結構與語義信息。與以往依賴於昂貴的真實標註的SSC研究不同,我們在無監督的設定下探索SSC。我們的新方法SceneDINO,借鑑了自監督表示學習與二維無監督場景理解技術,將其應用於SSC。我們的訓練僅利用多視圖一致性自監督,無需任何形式的語義或幾何真實標註。給定單一輸入圖像,SceneDINO以前饋方式推斷三維幾何結構及富有表現力的三維DINO特徵。通過一種新穎的三維特徵蒸餾方法,我們獲得了無監督的三維語義。在無監督的三維與二維場景理解任務中,SceneDINO均達到了最先進的分割精度。對我們的三維特徵進行線性探測,其分割精度可與當前有監督的SSC方法相媲美。此外,我們展示了SceneDINO在領域泛化與多視圖一致性方面的能力,為單圖像三維場景理解奠定了初步的堅實基礎。
English
Semantic scene completion (SSC) aims to infer both the 3D geometry and
semantics of a scene from single images. In contrast to prior work on SSC that
heavily relies on expensive ground-truth annotations, we approach SSC in an
unsupervised setting. Our novel method, SceneDINO, adapts techniques from
self-supervised representation learning and 2D unsupervised scene understanding
to SSC. Our training exclusively utilizes multi-view consistency
self-supervision without any form of semantic or geometric ground truth. Given
a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO
features in a feed-forward manner. Through a novel 3D feature distillation
approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised
scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy.
Linear probing our 3D features matches the segmentation accuracy of a current
supervised SSC approach. Additionally, we showcase the domain generalization
and multi-view consistency of SceneDINO, taking the first steps towards a
strong foundation for single image 3D scene understanding.