重新思考JEPA:使用冻结教師模型的計算高效視頻自監督學習
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers
September 29, 2025
作者: Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind, Vimal Thilak, Etai Littwin
cs.AI
摘要
視頻聯合嵌入預測架構(V-JEPA)通過在潛在空間中預測被遮罩區域,並利用指數移動平均(EMA)更新的教師模型,學習可泛化的即用型視頻表徵。雖然EMA防止了表徵崩潰,但它使可擴展的模型選擇變得複雜,並將教師與學生架構緊密耦合。我們重新審視了遮蔽潛在預測,並證明固定教師模型已足夠。具體而言,我們(i)在V-JEPA遮蔽下,以簡單的像素重建目標訓練目標編碼器,然後(ii)將其凍結並訓練學生模型來預測教師在遮蔽區域的潛在表徵。這形成了一個兩階段、無正則化的方案,我們稱之為SALT(靜態教師非對稱潛在訓練)。SALT將優化解耦為像素重建(教師)和遮蔽潛在預測(學生),提高了透明度、效率和可擴展性,同時保留了表徵在凍結評估下的泛化能力。實證表明,在多樣化的基準測試中,我們的學生模型在凍結骨幹評估下超越了最近提出的V-JEPA 2編碼器。它們還更具計算效率:在相同的預訓練FLOPs下,我們的方法實現了更高的探測準確率,其擴展曲線主導了V-JEPA的準確率-FLOPs帕累托前沿。最後,我們發現學生模型的質量對教師模型的質量表現出顯著的魯棒性:即使使用小型、次優的教師模型,也能產生高性能的學生模型。這表明計算預算應極大程度地傾向於學生模型。這些結果將SALT定位為一種簡單、可擴展且計算高效的替代方案,用於基於EMA的自蒸餾視頻表徵學習。
English
Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable
off-the-shelf video representation by predicting masked regions in latent space
with an exponential moving average (EMA)-updated teacher. While EMA prevents
representation collapse, it complicates scalable model selection and couples
teacher and student architectures. We revisit masked-latent prediction and show
that a frozen teacher suffices. Concretely, we (i) train a target encoder with
a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze
it and train a student to predict the teacher's latents on masked regions. This
leads to a two-stage, unregularized scheme that we refer to as SALT
(Static-teacher Asymmetric Latent Training). SALT decouples optimization into
pixel reconstruction (teacher) and masked latent prediction (student),
increasing transparency, efficiency, and scalability while preserving the
ability of representation to generalize under frozen evaluation. Empirically,
our student models outperform recently proposed V-JEPA 2 encoders under frozen
backbone evaluation across diverse benchmarks. They are also more
compute-optimal: at matched pretraining FLOPs, our method achieves higher
probing accuracy, and its scaling curves dominate V-JEPA's accuracy-FLOPs
Pareto frontier. Finally, we find that student quality is remarkably robust to
teacher quality: high-performing students emerge even with small, sub-optimal
teachers. This points to a compute budget allocation that should overwhelmingly
favor the student. These results position SALT as a simple, scalable, and
compute-efficient alternative to EMA-based self-distillation for video
representation learning.