MIMIC: 画像対応関係を用いたマスク画像モデリング

要旨

多くのピクセル単位の密な予測タスク（深度推定やセマンティックセグメンテーションなど）は、現在、事前学習された画像表現に依存しています。そのため、効果的な事前学習データセットを整備することが極めて重要です。しかし、効果的な事前学習データセットは、多視点シーンを有し、シミュレーション環境から得られた注釈付き3Dメッシュ、点群、カメラパラメータを用いてのみ整備されてきました。本論文では、注釈を一切必要としないデータセット整備メカニズムを提案します。私たちは、オープンソースのビデオデータセットと合成3D環境から、130万組の多視点画像ペアを含むMIMIC-1Mと、310万組の多視点画像ペアを含むMIMIC-3Mという2つのデータセットを構築しました。異なるマスク画像モデリング目的関数を用いて複数の自己教師ありモデルを学習し、以下の知見を示します：MIMIC-3Mで学習された表現は、深度推定、セマンティックセグメンテーション、表面法線、姿勢推定などの複数の下流タスクにおいて、注釈を用いて構築された表現を上回りました。また、表現が固定されている場合や、下流の学習データが少数ショットに限定されている場合にも優れた性能を示しました。より大規模なデータセット（MIMIC-3M）は性能を大幅に向上させ、私たちの整備方法が任意にスケールしてさらに大規模なデータセットを生成できる点で有望です。MIMICのコード、データセット、事前学習済みモデルは、https://github.com/RAIVNLab/MIMIC で公開されています。

English

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

MIMIC: 画像対応関係を用いたマスク画像モデリング

MIMIC: Masked Image Modeling with Image Correspondences

要旨

Support