SViMo:手物交互場景中的視頻與動作同步擴散生成
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
June 3, 2025
作者: Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
cs.AI
摘要
手物交互(HOI)生成具有显著的应用潜力。然而,当前的三维HOI运动生成方法严重依赖于预定义的三维物体模型和实验室捕获的运动数据,这限制了其泛化能力。同时,HOI视频生成方法优先考虑像素级的视觉保真度,往往牺牲了物理合理性。认识到视觉外观与运动模式在现实世界中共享基本的物理规律,我们提出了一种新颖的框架,该框架在同步扩散过程中结合了视觉先验和动态约束,以同时生成HOI视频和运动。为了整合异质的语义、外观和运动特征,我们的方法实现了三模态自适应调制以进行特征对齐,并结合三维全注意力机制来建模模态间和模态内的依赖关系。此外,我们引入了一种视觉感知的三维交互扩散模型,该模型直接从同步扩散输出中生成明确的三维交互序列,然后将其反馈回去,形成一个闭环反馈循环。这种架构消除了对预定义物体模型或明确姿态指导的依赖,同时显著增强了视频与运动的一致性。实验结果表明,我们的方法在生成高保真、动态合理的HOI序列方面优于最先进的方法,在未见过的现实场景中展现出显著的泛化能力。项目页面位于https://github.com/Droliven/SViMo\_project。
English
Hand-Object Interaction (HOI) generation has significant application
potential. However, current 3D HOI motion generation approaches heavily rely on
predefined 3D object models and lab-captured motion data, limiting
generalization capabilities. Meanwhile, HOI video generation methods prioritize
pixel-level visual fidelity, often sacrificing physical plausibility.
Recognizing that visual appearance and motion patterns share fundamental
physical laws in the real world, we propose a novel framework that combines
visual priors and dynamic constraints within a synchronized diffusion process
to generate the HOI video and motion simultaneously. To integrate the
heterogeneous semantics, appearance, and motion features, our method implements
tri-modal adaptive modulation for feature aligning, coupled with 3D
full-attention for modeling inter- and intra-modal dependencies. Furthermore,
we introduce a vision-aware 3D interaction diffusion model that generates
explicit 3D interaction sequences directly from the synchronized diffusion
outputs, then feeds them back to establish a closed-loop feedback cycle. This
architecture eliminates dependencies on predefined object models or explicit
pose guidance while significantly enhancing video-motion consistency.
Experimental results demonstrate our method's superiority over state-of-the-art
approaches in generating high-fidelity, dynamically plausible HOI sequences,
with notable generalization capabilities in unseen real-world scenarios.
Project page at https://github.com/Droliven/SViMo\_project.