全身條件下的第一人稱視角影片預測

摘要

我們訓練模型來預測基於人類動作的自我中心視角視頻（PEVA），該模型接收過去的視頻和以相對3D身體姿態表示的動作作為輸入。通過以由身體關節層次結構組織的運動學姿態軌跡為條件，我們的模型學會模擬物理人類動作如何從第一人稱視角塑造環境。我們在Nymeria這一包含大規模真實世界自我中心視頻和身體姿態捕捉的數據集上，訓練了一個自回歸條件擴散變換器。此外，我們設計了一個分層評估協議，包含難度遞增的任務，從而能夠全面分析模型的具身預測和控制能力。我們的工作代表了一項初步嘗試，旨在從人類視角出發，通過視頻預測來應對複雜現實世界環境和具身代理行為建模的挑戰。

English

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

全身條件下的第一人稱視角影片預測

Whole-Body Conditioned Egocentric Video Prediction

摘要

Support