SyncMV4D:面向手物交互合成的外观与运动同步多视图联合扩散
SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis
November 24, 2025
作者: Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu
cs.AI
摘要
手物交互生成技术在推动动画与机器人应用发展中具有关键作用。当前基于视频的方法主要采用单视角模式,这阻碍了全面的三维几何感知,并常导致几何失真或非真实运动模式。虽然三维手物交互方法能够生成动态合理的运动,但其对实验室受控环境下采集的高质量三维数据的依赖性,严重限制了其在真实场景中的泛化能力。为突破这些局限,我们提出了SyncMV4D——首个通过统一视觉先验、运动动力学和多视角几何来联合生成同步多视角手物交互视频与四维运动的模型。我们的框架具有两大核心创新:(1)协同生成手物交互视频与中间运动的多视角联合扩散模型;(2)将粗粒度中间运动优化为全局对齐的四维度量点轨迹的扩散点对齐器。为实现二维外观与四维动态的紧密耦合,我们建立了闭环式相互增强循环:在扩散去噪过程中,生成的视频为四维运动优化提供条件约束,而对齐后的四维点轨迹通过重投影指导下一步的联合生成。实验表明,本方法在视觉真实感、运动合理性和多视角一致性方面均优于当前最先进方案。
English
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.