全身条件化的第一人称视频预测

摘要

我们训练模型以预测基于人类动作的自我中心视频（PEVA），该模型接收过去的视频和由相对3D身体姿态表示的动作作为输入。通过以由身体关节层次结构组织的运动学姿态轨迹为条件，我们的模型学习从第一人称视角模拟物理人类动作如何塑造环境。我们在Nymeria数据集上训练了一个自回归条件扩散变换器，这是一个包含大规模真实世界自我中心视频和身体姿态捕捉的数据集。此外，我们设计了一个层次化评估协议，包含逐步增加难度的任务，从而能够全面分析模型的具身预测与控制能力。我们的工作代表了从人类视角出发，通过视频预测来应对复杂现实世界环境建模和具身代理行为挑战的初步尝试。

English

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

全身条件化的第一人称视频预测

Whole-Body Conditioned Egocentric Video Prediction

摘要

Support