視覚表現学習における世界モデルの学習と活用

要旨

共同埋め込み予測アーキテクチャ（JEPA）は、世界モデルを活用して学習する有望な自己教師ありアプローチとして登場しました。従来は入力の欠損部分を予測することに限定されていましたが、本研究ではJEPAの予測タスクをより広範な破損パターンに一般化する方法を探ります。我々は、マスク画像モデリングを超えて、潜在空間におけるグローバルな測光変換の効果を予測するアプローチである「画像世界モデル（Image World Models, IWM）」を提案します。高性能なIWMを学習するためのレシピを検討し、それが3つの重要な側面（条件付け、予測の難易度、容量）に依存することを示します。さらに、IWMによって学習された予測的世界モデルは、ファインチューニングを通じて多様なタスクを解決するために適応可能であり、ファインチューニングされたIWMの世界モデルは、従来の自己教師あり手法の性能を匹敵または凌駕することを示します。最後に、IWMを用いた学習により、学習された表現の抽象化レベルを制御できることを示し、対照的手法のような不変表現や、マスク画像モデリングのような等価表現を学習できることを明らかにします。

English

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

視覚表現学習における世界モデルの学習と活用

Learning and Leveraging World Models in Visual Representation Learning

要旨

Support