V-JEPA 2.1: 비디오 자기 지도 학습에서의 조밀 특징 추출 개방

초록

V-JEPA 2.1은 이미지와 비디오 모두에 대해 강력한 전역 장면 이해력을 유지하면서도 고밀도 및 고품질 시각 표현을 학습하는 자기 지도 모델 패밀리를 제안합니다. 본 접근법은 네 가지 핵심 구성 요소를 결합합니다. 첫째, 밀집 예측 손실은 가시 토큰과 마스킹된 토큰 모두가 훈련 신호에 기여하는 마스킹 기반 목표를 사용하여 명시적인 공간 및 시간적 기반을 강화합니다. 둘째, 심층 자기 지도는 표현 품질을 향상시키기 위해 여러 중간 인코더 계층에 걸쳐 계층적으로 자기 지도 목표를 적용합니다. 셋째, 다중 모달 토크나이저는 이미지와 비디오 간 통합 훈련을 가능하게 합니다. 마지막으로, 모델은 모델 용량과 훈련 데이터 양 측면에서 효과적인 확장의 이점을 얻습니다. 이러한 설계 선택을 통해 공간적으로 구조화되고 의미적으로 일관성 있으며 시간적으로 일관된 표현을 생성합니다. 실험적으로 V-JEPA 2.1은 여러 도전적인 벤치마크에서 최첨단 성능을 달성했습니다. 단기 객체 상호작용 예측을 위한 Ego4D에서 7.71 mAP, 고수준 행동 예측을 위한 EPIC-KITCHENS에서 40.8 Recall@5를 기록했으며, V-JEPA-2 AC 대비 실제 로봇 파지 성공률에서 20포인트 향상을 보였습니다. 또한 로봇 항법(TartanDrive에서 5.687 ATE), 깊이 추정(NYUv2에서 선형 탐색기를 사용한 0.307 RMSE), 전역 인식(Something-Something-V2에서 77.7)에서도 강력한 성능을 입증했습니다. 이러한 결과는 V-JEPA 2.1이 밀집 시각 이해 및 세계 모델링 분야의 최첨단 기술을 크게 발전시켰음을 보여줍니다.

English

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

V-JEPA 2.1: 비디오 자기 지도 학습에서의 조밀 특징 추출 개방

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

초록

Support