마스크된 장면 모델링: 3D 장면 이해에서 지도 학습과 자기 지도 학습 간의 격차 좁히기

초록

자기 지도 학습(self-supervised learning)은 대규모의 주석이 없는 데이터셋으로 훈련된 모델이 레이블을 사용해 훈련된 모델과 유사한 성능을 보이는 다목적의 즉시 사용 가능한 특징을 제공함으로써 2D 컴퓨터 비전을 혁신적으로 변화시켰습니다. 그러나 3D 장면 이해에서는 자기 지도 학습 방법이 일반적으로 특정 작업에 대한 미세 조정(fine-tuning)을 위한 가중치 초기화 단계로만 사용되며, 이는 일반적인 특징 추출을 위한 유용성을 제한합니다. 본 논문은 이러한 단점을 해결하기 위해 3D 장면 이해를 위한 자기 지도 학습 특징의 품질을 평가하기 위해 특별히 설계된 강력한 평가 프로토콜을 제안합니다. 우리의 프로토콜은 계층적 모델의 다중 해상도 특징 샘플링을 사용하여 모델의 의미론적 능력을 포착하는 풍부한 포인트 수준 표현을 생성하며, 따라서 선형 탐사(linear probing) 및 최근접 이웃(nearest-neighbor) 방법으로 평가하기에 적합합니다. 더 나아가, 우리는 선형 탐사 설정에서 즉시 사용 가능한 특징만을 사용할 때 지도 학습 모델과 유사한 성능을 보이는 첫 번째 자기 지도 학습 모델을 소개합니다. 특히, 우리의 모델은 마스크된 패치의 깊은 특징을 하향식(bottom-up) 방식으로 재구성하는 마스크된 장면 모델링(Masked Scene Modeling) 목표를 기반으로 한 새로운 자기 지도 학습 접근법을 통해 3D에서 기본적으로 훈련되며, 이는 계층적 3D 모델에 특화되어 있습니다. 우리의 실험은 우리의 방법이 지도 학습 모델과 경쟁력 있는 성능을 달성할 뿐만 아니라 기존의 자기 지도 학습 접근법을 큰 차이로 능가함을 보여줍니다. 모델과 훈련 코드는 우리의 Github 저장소(https://github.com/phermosilla/msm)에서 확인할 수 있습니다.

English

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

마스크된 장면 모델링: 3D 장면 이해에서 지도 학습과 자기 지도 학습 간의 격차 좁히기

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

초록

Support