시각적 표현 학습에서 세계 모델의 학습과 활용

초록

공동 임베딩 예측 아키텍처(Joint-Embedding Predictive Architecture, JEPA)는 세계 모델을 활용하여 학습하는 유망한 자기 지도 학습 접근법으로 부상하고 있다. 기존에는 입력의 누락된 부분을 예측하는 데 제한되었으나, 본 연구에서는 JEPA 예측 과제를 더 광범위한 변형에 일반화하는 방법을 탐구한다. 우리는 마스킹된 이미지 모델링을 넘어서는 접근법인 이미지 세계 모델(Image World Models, IWM)을 소개하며, 이는 잠재 공간에서 전역 광도 변환의 효과를 예측하는 방법을 학습한다. 우리는 성능이 우수한 IWM을 학습하기 위한 레시피를 연구하고, 이가 세 가지 핵심 요소인 조건화, 예측 난이도, 그리고 용량에 의존함을 보인다. 또한, IWM에 의해 학습된 예측 세계 모델은 미세 조정을 통해 다양한 과제를 해결하도록 적응될 수 있음을 보이며, 미세 조정된 IWM 세계 모델은 기존의 자기 지도 학습 방법들의 성능을 능가하거나 동등한 수준을 달성함을 입증한다. 마지막으로, IWM을 통한 학습은 학습된 표현의 추상화 수준을 제어할 수 있게 하여, 대조적 방법과 같은 불변 표현 또는 마스킹된 이미지 모델링과 같은 등변 표현을 학습할 수 있음을 보인다.

English

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

시각적 표현 학습에서 세계 모델의 학습과 활용

Learning and Leveraging World Models in Visual Representation Learning

초록

Support