EgoTwin：一人称視点における夢見る身体と視界

要旨

エクソセントリックなビデオ合成が大きな進展を遂げている一方で、エゴセントリックなビデオ生成は未だに十分に探求されていない領域であり、これには装着者の身体動作に起因するカメラの動きパターンとともに、一人称視点のコンテンツをモデル化することが求められる。このギャップを埋めるため、我々はエゴセントリックなビデオと人間の動作を同時に生成する新たなタスクを提案し、その特徴として以下の2つの主要な課題を挙げる：1) 視点整合性：生成されたビデオにおけるカメラ軌跡は、人間の動作から導出される頭部軌跡と正確に整合する必要がある；2) 因果的相互作用：合成された人間の動作は、隣接するビデオフレーム間で観察される視覚的ダイナミクスと因果的に整合する必要がある。これらの課題に対処するため、我々は拡散トランスフォーマーアーキテクチャに基づいたEgoTwinというビデオ-動作同時生成フレームワークを提案する。具体的には、EgoTwinは人間の動作を頭部関節に固定するヘッドセントリックな動作表現を導入し、ビデオと動作の因果的相互作用を明示的に捉えるサイバネティクスに着想を得た相互作用メカニズムをアテンション操作に組み込む。包括的な評価のため、我々はテキスト-ビデオ-動作の同期された大規模な実世界データセットをキュレーションし、ビデオ-動作の一貫性を評価するための新たな指標を設計した。広範な実験を通じて、EgoTwinフレームワークの有効性が実証された。

English

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.

EgoTwin：一人称視点における夢見る身体と視界

EgoTwin: Dreaming Body and View in First Person

要旨

Support