JEPA 재고찰: 고정된 교사 모델을 활용한 계산 효율적인 비디오 자기 지도 학습

초록

비디오 공통 임베딩 예측 아키텍처(V-JEPA)는 지수 이동 평균(EMA)으로 업데이트된 교사 모델을 사용하여 잠재 공간에서 마스킹된 영역을 예측함으로써 일반화 가능한 즉시 사용 가능한 비디오 표현을 학습합니다. EMA는 표현의 붕괴를 방지하지만, 확장 가능한 모델 선택을 복잡하게 만들고 교사와 학생 아키텍처를 결합시킵니다. 우리는 마스킹된 잠재 예측을 재검토하고, 고정된 교사 모델로도 충분함을 보여줍니다. 구체적으로, 우리는 (i) V-JEPA 마스킹 하에서 간단한 픽셀 재구성 목표를 사용하여 타겟 인코더를 학습한 후, (ii) 이를 고정하고 학생 모델이 마스킹된 영역에서 교사의 잠재를 예측하도록 학습시킵니다. 이는 두 단계로 이루어진 비정규화된 방식으로, 우리는 이를 SALT(Static-teacher Asymmetric Latent Training)라고 부릅니다. SALT는 최적화를 픽셀 재구성(교사)과 마스킹된 잠재 예측(학생)으로 분리하여 투명성, 효율성 및 확장성을 높이면서도 고정 평가 하에서 표현의 일반화 능력을 유지합니다. 실험적으로, 우리의 학생 모델은 다양한 벤치마크에서 최근 제안된 V-JEPA 2 인코더를 고정 백본 평가 하에서 능가합니다. 또한 계산 효율성도 더 뛰어납니다: 동일한 사전 학습 FLOPs에서 우리의 방법은 더 높은 프로빙 정확도를 달성하며, 그 확장 곡선은 V-JEPA의 정확도-FLOPs 파레토 프론티어를 지배합니다. 마지막으로, 학생 모델의 품질이 교사 모델의 품질에 대해 놀라울 정도로 강건함을 발견했습니다: 작고 최적이 아닌 교사 모델에서도 고성능의 학생 모델이 나타납니다. 이는 계산 예산을 압도적으로 학생 모델에 할당해야 함을 시사합니다. 이러한 결과는 SALT를 비디오 표현 학습을 위한 EMA 기반 자기 증류의 간단하고 확장 가능하며 계산 효율적인 대안으로 자리매김합니다.

English

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher's latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA's accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

JEPA 재고찰: 고정된 교사 모델을 활용한 계산 효율적인 비디오 자기 지도 학습

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

초록

Support