V-JEPA 2：自監督視頻模型實現理解、預測與規劃

摘要

现代人工智能面临的一大挑战，在于如何主要通过观察来学习理解世界并采取行动。本文探讨了一种自监督方法，该方法将互联网规模的视频数据与少量交互数据（机器人轨迹）相结合，以开发出能够理解、预测和规划物理世界的模型。我们首先在一个包含超过100万小时互联网视频的视频和图像数据集上，对无动作的联合嵌入预测架构V-JEPA 2进行了预训练。V-JEPA 2在运动理解方面表现出色（在Something-Something v2数据集上达到77.3的top-1准确率），并在人类动作预测上实现了最先进的性能（在Epic-Kitchens-100数据集上达到39.7的召回率@5），超越了以往针对特定任务的模型。此外，在将V-JEPA 2与大型语言模型对齐后，我们在80亿参数规模上展示了在多个视频问答任务上的最先进性能（例如，在PerceptionTest上达到84.0，在TempCompass上达到76.9）。最后，我们展示了如何通过使用Droid数据集中不到62小时的无标签机器人视频对潜在动作条件世界模型V-JEPA 2-AC进行后训练，将自监督学习应用于机器人规划任务。我们在两个不同实验室的Franka机械臂上零样本部署了V-JEPA 2-AC，并利用图像目标规划实现了物体的抓取和放置。值得注意的是，这一成果是在未从这些环境中的机器人收集任何数据，且未进行任何任务特定训练或奖励的情况下实现的。本研究表明，通过从网络规模数据和少量机器人交互数据中进行自监督学习，可以构建出能够在物理世界中进行规划的世界模型。

English

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

V-JEPA 2：自監督視頻模型實現理解、預測與規劃

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

摘要

Support