全身条件付きエゴセントリック動画予測

要旨

人間の行動からエゴセントリックな映像を予測するモデル（PEVA）を訓練します。このモデルは、過去の映像と相対的な3D身体姿勢で表される行動を入力として受け取ります。身体の関節階層によって構造化された運動学的姿勢軌跡に条件付けすることで、物理的な人間の行動が環境をどのように形成するかを一人称視点でシミュレートすることを学習します。大規模な実世界のエゴセントリック映像と身体姿勢キャプチャのデータセットであるNymeriaを用いて、自己回帰的な条件付き拡散トランスフォーマーを訓練します。さらに、難易度を段階的に上げた階層的な評価プロトコルを設計し、モデルの具現化された予測と制御能力を包括的に分析します。本研究は、複雑な実世界の環境と具現化されたエージェントの行動を、人間の視点から映像予測を通じてモデル化するという課題に取り組む最初の試みです。

English

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

全身条件付きエゴセントリック動画予測

Whole-Body Conditioned Egocentric Video Prediction

要旨

Support