SViMo:手物交互场景下的视频与动作同步扩散生成
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
June 3, 2025
作者: Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
cs.AI
摘要
手-物交互(HOI)生成具有显著的应用潜力。然而,当前的三维HOI运动生成方法严重依赖预定义的三维物体模型和实验室捕获的运动数据,限制了其泛化能力。同时,HOI视频生成方法更注重像素级的视觉保真度,往往牺牲了物理合理性。认识到视觉外观与运动模式在现实世界中遵循相同的物理规律,我们提出了一种新颖的框架,该框架在同步扩散过程中结合视觉先验与动态约束,以同时生成HOI视频和运动。为了整合异质的语义、外观及运动特征,我们的方法实现了三模态自适应调制以对齐特征,并辅以三维全注意力机制来建模模态间与模态内的依赖关系。此外,我们引入了一种视觉感知的三维交互扩散模型,该模型直接从同步扩散输出中生成明确的三维交互序列,随后将其反馈以建立闭环反馈循环。这一架构消除了对预定义物体模型或显式姿态指导的依赖,同时显著增强了视频与运动的一致性。实验结果表明,我们的方法在生成高保真、动态合理的HOI序列方面优于现有技术,并在未见过的现实场景中展现出卓越的泛化能力。项目页面请访问https://github.com/Droliven/SViMo\_project。
English
Hand-Object Interaction (HOI) generation has significant application
potential. However, current 3D HOI motion generation approaches heavily rely on
predefined 3D object models and lab-captured motion data, limiting
generalization capabilities. Meanwhile, HOI video generation methods prioritize
pixel-level visual fidelity, often sacrificing physical plausibility.
Recognizing that visual appearance and motion patterns share fundamental
physical laws in the real world, we propose a novel framework that combines
visual priors and dynamic constraints within a synchronized diffusion process
to generate the HOI video and motion simultaneously. To integrate the
heterogeneous semantics, appearance, and motion features, our method implements
tri-modal adaptive modulation for feature aligning, coupled with 3D
full-attention for modeling inter- and intra-modal dependencies. Furthermore,
we introduce a vision-aware 3D interaction diffusion model that generates
explicit 3D interaction sequences directly from the synchronized diffusion
outputs, then feeds them back to establish a closed-loop feedback cycle. This
architecture eliminates dependencies on predefined object models or explicit
pose guidance while significantly enhancing video-motion consistency.
Experimental results demonstrate our method's superiority over state-of-the-art
approaches in generating high-fidelity, dynamically plausible HOI sequences,
with notable generalization capabilities in unseen real-world scenarios.
Project page at https://github.com/Droliven/SViMo\_project.