學習和利用世界模型在視覺表示學習中的應用

摘要

聯合嵌入預測架構（JEPA）已成為一種有前途的自監督方法，通過利用世界模型來學習。雖然以前僅限於預測輸入中缺失的部分，我們探索了如何將JEPA預測任務擴展到更廣泛的損壞集。我們引入了圖像世界模型，這種方法超越了遮罩圖像建模，學習在潛在空間中預測全局光度變換的影響。我們研究了學習高效IWM的配方，並展示它依賴於三個關鍵方面：條件、預測困難度和容量。此外，我們展示了通過微調可以適應IWM學習的預測世界模型來解決各種任務；經過微調的IWM世界模型與或超越了先前的自監督方法的性能。最後，我們表明通過IWM學習可以控制所學表示的抽象級別，學習不變表示，如對比方法，或等變表示，如遮罩圖像建模。

English

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

學習和利用世界模型在視覺表示學習中的應用

Learning and Leveraging World Models in Visual Representation Learning

摘要

Support