MIMIC: 이미지 대응을 활용한 마스킹 이미지 모델링

초록

많은 픽셀 단위의 밀집 예측 작업들, 특히 컴퓨터 비전 분야에서의 깊이 추정과 의미론적 분할은 사전 학습된 이미지 표현에 의존하고 있습니다. 따라서 효과적인 사전 학습 데이터셋을 구축하는 것이 매우 중요합니다. 그러나 현재까지 효과적인 사전 학습 데이터셋은 다중 뷰 장면을 포함하며, 시뮬레이션 환경에서 주석이 달린 3D 메시, 포인트 클라우드, 카메라 파라미터를 사용하여 구축된 것들뿐이었습니다. 본 연구에서는 어떠한 주석도 필요로 하지 않는 데이터셋 구축 메커니즘을 제안합니다. 우리는 오픈소스 비디오 데이터셋과 합성 3D 환경에서 130만 개의 다중 뷰 이미지 쌍을 포함한 MIMIC-1M과 310만 개의 다중 뷰 이미지 쌍을 포함한 MIMIC-3M이라는 두 가지 데이터셋을 구축했습니다. 다양한 마스크된 이미지 모델링 목적을 가진 여러 자기 지도 학습 모델을 학습시켜 다음과 같은 결과를 확인했습니다: MIMIC-3M에서 학습된 표현은 깊이 추정, 의미론적 분할, 표면 법선, 포즈 추정 등 다양한 다운스트림 작업에서 주석을 사용하여 구축된 데이터셋보다 우수한 성능을 보였습니다. 또한, 다운스트림 학습 데이터가 소량으로 제한된 경우에도 고정된 표현보다 우수한 성능을 보였습니다. 더 큰 데이터셋(MIMIC-3M)은 성능을 크게 향상시켰으며, 이는 우리의 구축 방법이 임의로 확장되어 더 큰 데이터셋을 생성할 수 있다는 점에서 매우 유망합니다. MIMIC 코드, 데이터셋, 사전 학습된 모델은 https://github.com/RAIVNLab/MIMIC에서 오픈소스로 제공됩니다.

English

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

MIMIC: 이미지 대응을 활용한 마스킹 이미지 모델링

MIMIC: Masked Image Modeling with Image Correspondences

초록

Support