V-JEPA 2.1：映像自己教師あり学習における高密度特徴の解放

要旨

本論文では、画像と動画の両方に対して高密度かつ高品質な視覚的表現を学習しつつ、強力な大域的なシーン理解を保持する自己教師ありモデル群であるV-JEPA 2.1を提案する。本手法は、4つの主要な構成要素を組み合わせている。第一に、高密度予測損失は、可視トークンとマスクトークンの両方が訓練信号に寄与するマスキングベースの目的関数を用いることで、明示的な空間的・時間的接地を促進する。第二に、深層自己教師は、自己教師ありの目的関数を複数の中間エンコーダ層で階層的に適用し、表現の質を向上させる。第三に、マルチモーダルトークナイザにより、画像と動画にわたる統一的な訓練を可能にする。最後に、モデルはモデル容量と訓練データの両方における効果的なスケーリングの恩恵を受けている。これらの設計選択が統合されることで、空間的に構造化され、意味的に一貫性があり、時間的に安定した表現が生成される。実験結果では、V-JEPA 2.1は複数の難易度の高いベンチマークで最先端の性能を達成した。具体的には、Ego4Dにおける短期物体インタラクション予測で7.71 mAP、EPIC-KITCHENSにおける高レベル行動予測で40.8 Recall@5を記録し、V-JEPA-2 ACと比較して実ロボット把持成功率で20ポイントの改善を示した。さらに、本モデルはロボットナビゲーション（TartanDriveで5.687 ATE）、深度推定（NYUv2で線形プローブを用いて0.307 RMSE）、大域的认识（Something-Something-V2で77.7）においても強力な性能を実証した。これらの結果は、V-JEPA 2.1が高密度視覚理解と世界モデリングの技術を大幅に進歩させたことを示している。

English

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

V-JEPA 2.1：映像自己教師あり学習における高密度特徴の解放

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

要旨

Support