V-JEPA 2: 자기 지도 비디오 모델을 통한 이해, 예측 및 계획 가능성

초록

현대 인공지능의 주요 과제는 주로 관찰을 통해 세계를 이해하고 행동하는 법을 배우는 것이다. 본 논문은 인터넷 규모의 비디오 데이터와 소량의 상호작용 데이터(로봇 궤적)를 결합하여 물리적 세계에서 이해, 예측, 계획을 수행할 수 있는 모델을 개발하는 자기 지도 학습 접근법을 탐구한다. 먼저, 우리는 100만 시간 이상의 인터넷 비디오로 구성된 비디오 및 이미지 데이터셋에서 동작이 없는 공통 임베딩 예측 아키텍처인 V-JEPA 2를 사전 학습한다. V-JEPA 2는 동작 이해(Something-Something v2에서 77.3의 top-1 정확도)와 인간 행동 예측(Epic-Kitchens-100에서 39.7의 recall-at-5)에서 강력한 성능을 달성하며, 이전의 작업 특화 모델들을 능가한다. 또한, V-JEPA 2를 대규모 언어 모델과 정렬한 후, 80억 파라미터 규모에서 여러 비디오 질의응답 작업(예: PerceptionTest에서 84.0, TempCompass에서 76.9)에서 최첨단 성능을 보여준다. 마지막으로, 우리는 Droid 데이터셋의 62시간 미만의 레이블 없는 로봇 비디오를 사용하여 잠재적 동작 조건부 세계 모델인 V-JEPA 2-AC를 사후 학습함으로써 자기 지도 학습이 로봇 계획 작업에 어떻게 적용될 수 있는지 보여준다. 우리는 V-JEPA 2-AC를 두 개의 다른 실험실에서 Franka 팔에 제로샷으로 배포하고 이미지 목표를 사용한 계획을 통해 물체를 집고 놓는 작업을 가능하게 한다. 특히, 이는 이러한 환경에서 로봇으로부터 데이터를 수집하지 않고도, 그리고 작업 특화 훈련이나 보상 없이도 달성되었다. 이 연구는 웹 규모의 데이터와 소량의 로봇 상호작용 데이터로부터의 자기 지도 학습이 물리적 세계에서 계획을 수행할 수 있는 세계 모델을 어떻게 얻을 수 있는지를 보여준다.

English

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

V-JEPA 2: 자기 지도 비디오 모델을 통한 이해, 예측 및 계획 가능성

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

초록

Support