JEPAの再考：凍結教師モデルを用いた計算効率の良いビデオ自己教師あり学習

要旨

Video Joint Embedding Predictive Architectures (V-JEPA) は、指数移動平均 (EMA) で更新される教師モデルを用いて、潜在空間におけるマスク領域を予測することで、汎用的なオフ・ザ・シェルフのビデオ表現を学習します。EMA は表現の崩壊を防ぎますが、スケーラブルなモデル選択を複雑にし、教師と学生のアーキテクチャを密結合させます。我々はマスクされた潜在予測を再検討し、凍結された教師モデルで十分であることを示します。具体的には、(i) V-JEPA のマスキング下で単純なピクセル再構成目的関数を用いてターゲットエンコーダを訓練し、(ii) それを凍結して、学生モデルに教師モデルの潜在をマスク領域で予測させるように訓練します。これにより、我々が SALT (Static-teacher Asymmetric Latent Training) と呼ぶ、2段階の正則化なしのスキームが導かれます。SALT は最適化をピクセル再構成（教師）とマスクされた潜在予測（学生）に分離し、透明性、効率性、スケーラビリティを向上させながら、凍結評価下での表現の汎化能力を維持します。実験的に、我々の学生モデルは、多様なベンチマークにおいて、最近提案された V-JEPA 2 エンコーダを凍結バックボーン評価下で上回ります。また、計算効率も優れており、同等の事前訓練 FLOPs において、我々の手法はより高いプロービング精度を達成し、そのスケーリング曲線は V-JEPA の精度-FLOPs パレートフロンティアを支配します。最後に、学生モデルの品質は教師モデルの品質に対して驚くほど頑健であることがわかりました：小さく、最適でない教師モデルであっても、高性能な学生モデルが出現します。これは、計算予算の割り当てが学生モデルに圧倒的に偏るべきであることを示唆しています。これらの結果は、SALT を、ビデオ表現学習における EMA ベースの自己蒸留に対するシンプルでスケーラブル、かつ計算効率の良い代替手法として位置づけます。

English

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher's latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA's accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

JEPAの再考：凍結教師モデルを用いた計算効率の良いビデオ自己教師あり学習

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

要旨

Support