学习和利用世界模型在视觉表示学习中的应用
Learning and Leveraging World Models in Visual Representation Learning
March 1, 2024
作者: Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun
cs.AI
摘要
联合嵌入预测架构(JEPA)已经成为一种有前途的自监督方法,通过利用世界模型进行学习。虽然以前仅限于预测输入中缺失的部分,我们探讨了如何将JEPA预测任务推广到更广泛的损坏集。我们引入了图像世界模型,这种方法超越了遮罩图像建模,学会在潜在空间中预测全局光度变换的影响。我们研究了学习高性能IWM的配方,并表明它依赖于三个关键方面:条件、预测难度和容量。此外,我们展示了通过微调适应IWM学习的预测世界模型可以解决各种任务;经过微调的IWM世界模型与以前的自监督方法的性能相匹敌甚至超越。最后,我们表明通过IWM学习可以控制学习表示的抽象级别,学习不变表示,如对比方法,或等变表示,如遮罩图像建模。
English
Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising
self-supervised approach that learns by leveraging a world model. While
previously limited to predicting missing parts of an input, we explore how to
generalize the JEPA prediction task to a broader set of corruptions. We
introduce Image World Models, an approach that goes beyond masked image
modeling and learns to predict the effect of global photometric transformations
in latent space. We study the recipe of learning performant IWMs and show that
it relies on three key aspects: conditioning, prediction difficulty, and
capacity. Additionally, we show that the predictive world model learned by IWM
can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM
world model matches or surpasses the performance of previous self-supervised
methods. Finally, we show that learning with an IWM allows one to control the
abstraction level of the learned representations, learning invariant
representations such as contrastive methods, or equivariant representations
such as masked image modelling.