MIMIC：带有图像对应关系的遮罩图像建模

摘要

当今计算机视觉中许多像素级密集预测任务，如深度估计和语义分割，都依赖于预训练的图像表示。因此，策划有效的预训练数据集至关重要。不幸的是，有效的预训练数据集通常是那些仅使用模拟环境中的带注释的3D网格、点云和摄像机参数策划的多视角场景。我们提出了一种数据集策划机制，不需要任何注释。我们挖掘了两个数据集：MIMIC-1M 包含来自开源视频数据集和合成3D环境的130万多视角图像对，MIMIC-3M 包含310万多视角图像对。我们训练了多个自监督模型，采用不同的遮罩图像建模目标，展示了以下发现：在多个下游任务中，包括深度估计、语义分割、表面法线和姿态估计，使用 MIMIC-3M 训练的表示优于使用注释挖掘的表示。它们还优于被冻结的表示，并且当下游训练数据有限时，表现也更好。更大的数据集（MIMIC-3M）显著提高了性能，这是令人鼓舞的，因为我们的策划方法可以任意扩展以生成更大的数据集。MIMIC 代码、数据集和预训练模型已在 https://github.com/RAIVNLab/MIMIC 开源。

English

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

MIMIC：带有图像对应关系的遮罩图像建模

MIMIC: Masked Image Modeling with Image Correspondences

摘要

Support