學習和利用世界模型在視覺表示學習中的應用
Learning and Leveraging World Models in Visual Representation Learning
March 1, 2024
作者: Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun
cs.AI
摘要
聯合嵌入預測架構(JEPA)已成為一種有前途的自監督方法,通過利用世界模型來學習。雖然以前僅限於預測輸入中缺失的部分,我們探索了如何將JEPA預測任務擴展到更廣泛的損壞集。我們引入了圖像世界模型,這種方法超越了遮罩圖像建模,學習在潛在空間中預測全局光度變換的影響。我們研究了學習高效IWM的配方,並展示它依賴於三個關鍵方面:條件、預測困難度和容量。此外,我們展示了通過微調可以適應IWM學習的預測世界模型來解決各種任務;經過微調的IWM世界模型與或超越了先前的自監督方法的性能。最後,我們表明通過IWM學習可以控制所學表示的抽象級別,學習不變表示,如對比方法,或等變表示,如遮罩圖像建模。
English
Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising
self-supervised approach that learns by leveraging a world model. While
previously limited to predicting missing parts of an input, we explore how to
generalize the JEPA prediction task to a broader set of corruptions. We
introduce Image World Models, an approach that goes beyond masked image
modeling and learns to predict the effect of global photometric transformations
in latent space. We study the recipe of learning performant IWMs and show that
it relies on three key aspects: conditioning, prediction difficulty, and
capacity. Additionally, we show that the predictive world model learned by IWM
can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM
world model matches or surpasses the performance of previous self-supervised
methods. Finally, we show that learning with an IWM allows one to control the
abstraction level of the learned representations, learning invariant
representations such as contrastive methods, or equivariant representations
such as masked image modelling.