MIMIC：利用影像對應進行遮罩影像建模

摘要

當今在計算機視覺中，許多像素級密集預測任務，如深度估計和語義分割，都依賴於預訓練的圖像表示。因此，精心策劃有效的預訓練數據集至關重要。不幸的是，有效的預訓練數據集是那些具有多視角場景並僅使用模擬環境中的帶標註的3D網格、點雲和相機參數精心策劃的數據集。我們提出了一種不需要任何標註的數據集策劃機制。我們從開源視頻數據集和合成3D環境中採集了兩個數據集：MIMIC-1M（擁有130萬多視角圖像對）和MIMIC-3M（擁有310萬多視角圖像對）。我們訓練了多個具有不同遮罩圖像建模目標的自監督模型，展示了以下發現：在多個下游任務（包括深度估計、語義分割、表面法線和姿態估計）上，使用MIMIC-3M訓練的表示優於使用標註採集的表示。它們還優於凍結的表示，並且當下游訓練數據受限於少量樣本時表現更好。更大的數據集（MIMIC-3M）顯著提高了性能，這是令人鼓舞的，因為我們的策劃方法可以任意擴展以生成更大的數據集。MIMIC代碼、數據集和預訓練模型均在https://github.com/RAIVNLab/MIMIC上開源。

English

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

MIMIC：利用影像對應進行遮罩影像建模

MIMIC: Masked Image Modeling with Image Correspondences

摘要

Support