ChatPaper.aiChatPaper

面向视角自适应人体视频生成的3D感知隐式运动控制

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

February 3, 2026
作者: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai
cs.AI

摘要

现有视频生成中的人体运动控制方法通常依赖2D姿态或显式3D参数化模型(如SMPL)作为控制信号。然而,2D姿态会将运动 rigidly 绑定到驱动视角,无法实现新视角合成。显式3D模型虽具有结构信息优势,但存在固有缺陷(如深度歧义与动态不精确),当作为强约束使用时,会压制大规模视频生成器强大的内在3D感知能力。本文从3D感知视角重新审视运动控制,提出一种隐式的、视角无关的运动表征方式,使其自然契合生成器的空间先验,而非依赖外部重建的约束。我们提出3DiMo方法,通过联合训练运动编码器与预训练视频生成器,将驱动帧蒸馏为紧凑的视角无关运动标记,并借助交叉注意力进行语义注入。为增强3D感知能力,我们采用多视角监督(包括单视角、多视角及运动摄像机视频)进行训练,强制模型在不同视角下保持运动一致性。此外,通过辅助几何监督——仅利用SMPL进行早期初始化并逐步退火至零——使模型能够从外部3D指导过渡到从数据及生成器先验中学习真实的3D空间运动理解。实验证实,3DiMo能准确复现驱动运动并支持灵活的文本驱动摄像机控制,在运动保真度与视觉质量上均显著超越现有方法。
English
Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
PDF435February 5, 2026