自我双生:第一人称视角下的梦境身体与视界
EgoTwin: Dreaming Body and View in First Person
August 18, 2025
作者: Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu
cs.AI
摘要
尽管外中心视频合成已取得显著进展,但第一人称视角视频生成仍处于探索不足的状态,这需要同时模拟第一人称视角内容以及由佩戴者身体运动引发的相机运动模式。为填补这一空白,我们提出了一项新颖的任务——联合生成第一人称视频与人体运动,该任务面临两大关键挑战:1)视角对齐:生成视频中的相机轨迹必须精确匹配由人体运动推导出的头部轨迹;2)因果互动:合成的人体运动必须在相邻视频帧间与观察到的视觉动态因果对齐。针对这些挑战,我们提出了EgoTwin,一个基于扩散变换器架构的联合视频-运动生成框架。具体而言,EgoTwin引入了一种以头部为中心的运动表示法,将人体运动锚定于头部关节,并融入了一种受控制论启发的交互机制,该机制在注意力操作中显式捕捉视频与运动之间的因果互动。为了进行全面评估,我们精心策划了一个大规模的现实世界数据集,包含同步的文本-视频-运动三元组,并设计了新颖的指标来评估视频与运动的一致性。大量实验验证了EgoTwin框架的有效性。
English
While exocentric video synthesis has achieved great progress, egocentric
video generation remains largely underexplored, which requires modeling
first-person view content along with camera motion patterns induced by the
wearer's body movements. To bridge this gap, we introduce a novel task of joint
egocentric video and human motion generation, characterized by two key
challenges: 1) Viewpoint Alignment: the camera trajectory in the generated
video must accurately align with the head trajectory derived from human motion;
2) Causal Interplay: the synthesized human motion must causally align with the
observed visual dynamics across adjacent video frames. To address these
challenges, we propose EgoTwin, a joint video-motion generation framework built
on the diffusion transformer architecture. Specifically, EgoTwin introduces a
head-centric motion representation that anchors the human motion to the head
joint and incorporates a cybernetics-inspired interaction mechanism that
explicitly captures the causal interplay between video and motion within
attention operations. For comprehensive evaluation, we curate a large-scale
real-world dataset of synchronized text-video-motion triplets and design novel
metrics to assess video-motion consistency. Extensive experiments demonstrate
the effectiveness of the EgoTwin framework.