비지도 의미론적 장면 완성을 위한 피드포워드 SceneDINO

초록

시맨틱 장면 완성(Semantic Scene Completion, SSC)은 단일 이미지로부터 장면의 3D 기하학적 구조와 의미를 추론하는 것을 목표로 한다. 기존의 SSC 연구가 비용이 많이 드는 정답(ground-truth) 주석에 크게 의존하는 것과 달리, 우리는 비지도 학습 환경에서 SSC에 접근한다. 우리의 새로운 방법인 SceneDINO는 자기 지도 표현 학습(self-supervised representation learning)과 2D 비지도 장면 이해(unsupervised scene understanding) 기술을 SSC에 적용한다. 우리의 학습은 의미론적 또는 기하학적 정답 없이 다중 뷰 일관성(multi-view consistency) 자기 지도만을 활용한다. 단일 입력 이미지가 주어지면, SceneDINO는 순전파 방식으로 3D 기하학적 구조와 표현력 있는 3D DINO 특징을 추론한다. 새로운 3D 특징 증류(3D feature distillation) 접근법을 통해, 우리는 비지도 3D 의미론을 획득한다. 3D 및 2D 비지도 장면 이해에서 SceneDINO는 최첨단 분할 정확도를 달성한다. 우리의 3D 특징에 선형 탐사(linear probing)를 적용하면, 현재의 지도 학습 SSC 접근법의 분할 정확도와 일치한다. 또한, 우리는 SceneDINO의 도메인 일반화 및 다중 뷰 일관성을 보여주며, 단일 이미지 3D 장면 이해를 위한 강력한 기반을 마련하는 첫걸음을 내딛는다.

English

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.