用於視點自適應人體影片生成的3D感知隱式運動控制
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
February 3, 2026
作者: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai
cs.AI
摘要
現有的人類動作控制影片生成方法通常依賴二維姿勢或顯式三維參數模型(如SMPL)作為控制信號。然而,二維姿勢會將動作剛性綁定於驅動視角,無法實現新視角合成。顯式三維模型雖具結構信息,但存在固有誤差(如深度模糊性與動態不準確性),當作為強約束使用時,會覆蓋大規模影片生成器內在的強大三維感知能力。本研究從三維感知視角重新審視動作控制,主張採用與生成器空間先驗自然契合的隱式、視角無關動作表徵,而非依賴外部重建的約束。我們提出3DiMo方法,通過聯合訓練動作編碼器與預訓練影片生成器,將驅動影格提煉為緊湊的視角無關動作標記,並經由交叉注意力進行語義注入。為增強三維感知,我們採用多視角監督(即單視角、多視角及運動相機影片)進行訓練,強制模型在不同視角下保持動作一致性。此外,通過輔助幾何監督僅在早期初始化階段利用SMPL模型,並逐步衰減至零,使模型能從外部三維指導過渡至從數據和生成器先驗中學習真實的三維空間運動理解。實驗證實,3DiMo能精準復現驅動動作並實現靈活的文本驅動相機控制,在動作保真度與視覺品質上均顯著超越現有方法。
English
Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.