MC-JEPA: モーションとコンテンツ特徴の自己教師あり学習のためのジョイントエンベディング予測アーキテクチャ

要旨

視覚表現の自己教師あり学習はこれまで、物体の動きや位置を捉えず、画像や動画内の物体を識別・区別することに焦点を当てたコンテンツ特徴の学習に注力してきました。一方、オプティカルフロー推定は、推定対象となる画像の内容理解を伴わないタスクです。本研究ではこれら二つのアプローチを統合し、MC-JEPAという共同埋め込み予測アーキテクチャと自己教師あり学習手法を提案します。これにより、共有エンコーダ内でオプティカルフローとコンテンツ特徴を共同で学習し、オプティカルフロー推定の目的関数と自己教師あり学習の目的関数が互いに利益をもたらし、動き情報を組み込んだコンテンツ特徴を学習できることを実証しました。提案手法は、既存の教師なしオプティカルフローベンチマークと同等の性能を達成するだけでなく、画像や動画のセマンティックセグメンテーションなどの下流タスクにおいても、一般的な自己教師あり学習手法と同等の性能を発揮します。

English

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

MC-JEPA: モーションとコンテンツ特徴の自己教師あり学習のためのジョイントエンベディング予測アーキテクチャ

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

要旨

Support