V-JEPA 2:自監督視頻模型實現理解、預測與規劃
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
June 11, 2025
作者: Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas
cs.AI
摘要
现代人工智能面临的一大挑战,在于如何主要通过观察来学习理解世界并采取行动。本文探讨了一种自监督方法,该方法将互联网规模的视频数据与少量交互数据(机器人轨迹)相结合,以开发出能够理解、预测和规划物理世界的模型。我们首先在一个包含超过100万小时互联网视频的视频和图像数据集上,对无动作的联合嵌入预测架构V-JEPA 2进行了预训练。V-JEPA 2在运动理解方面表现出色(在Something-Something v2数据集上达到77.3的top-1准确率),并在人类动作预测上实现了最先进的性能(在Epic-Kitchens-100数据集上达到39.7的召回率@5),超越了以往针对特定任务的模型。此外,在将V-JEPA 2与大型语言模型对齐后,我们在80亿参数规模上展示了在多个视频问答任务上的最先进性能(例如,在PerceptionTest上达到84.0,在TempCompass上达到76.9)。最后,我们展示了如何通过使用Droid数据集中不到62小时的无标签机器人视频对潜在动作条件世界模型V-JEPA 2-AC进行后训练,将自监督学习应用于机器人规划任务。我们在两个不同实验室的Franka机械臂上零样本部署了V-JEPA 2-AC,并利用图像目标规划实现了物体的抓取和放置。值得注意的是,这一成果是在未从这些环境中的机器人收集任何数据,且未进行任何任务特定训练或奖励的情况下实现的。本研究表明,通过从网络规模数据和少量机器人交互数据中进行自监督学习,可以构建出能够在物理世界中进行规划的世界模型。
English
A major challenge for modern AI is to learn to understand the world and learn
to act largely by observation. This paper explores a self-supervised approach
that combines internet-scale video data with a small amount of interaction data
(robot trajectories), to develop models capable of understanding, predicting,
and planning in the physical world. We first pre-train an action-free
joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset
comprising over 1 million hours of internet video. V-JEPA 2 achieves strong
performance on motion understanding (77.3 top-1 accuracy on Something-Something
v2) and state-of-the-art performance on human action anticipation (39.7
recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models.
Additionally, after aligning V-JEPA 2 with a large language model, we
demonstrate state-of-the-art performance on multiple video question-answering
tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on
TempCompass). Finally, we show how self-supervised learning can be applied to
robotic planning tasks by post-training a latent action-conditioned world
model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the
Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different
labs and enable picking and placing of objects using planning with image goals.
Notably, this is achieved without collecting any data from the robots in these
environments, and without any task-specific training or reward. This work
demonstrates how self-supervised learning from web-scale data and a small
amount of robot interaction data can yield a world model capable of planning in
the physical world.