共舞时刻!身份保持型多人互动视频生成
DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation
May 23, 2025
作者: Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiaoxiao Long, Ruqi Huang
cs.AI
摘要
可控视频生成(CVG)技术已取得显著进展,然而现有系统在面对多个主体需在噪声控制信号下移动、互动及交换位置时仍显不足。我们通过DanceTogether填补了这一空白,这是首个端到端的扩散框架,能够将单一参考图像与独立的姿态掩码流转化为长时、逼真的视频,同时严格保持每个身份的完整性。新颖的MaskPoseAdapter在每一步去噪过程中,通过融合鲁棒的跟踪掩码与语义丰富但含噪的姿态热图,将“谁”与“如何”紧密绑定,消除了困扰逐帧处理流程的身份漂移和外观渗漏问题。为了大规模训练与评估,我们引入了(i) PairFS-4K,包含26小时的双滑冰者视频,涵盖7,000多个独特身份;(ii) HumanRob-300,一个时长一小时的人形机器人互动数据集,用于快速跨领域迁移;以及(iii) TogetherVideoBench,一个三轨基准测试,围绕DanceTogEval-100测试集展开,涵盖舞蹈、拳击、摔跤、瑜伽和花样滑冰。在TogetherVideoBench上,DanceTogether以显著优势超越了现有技术。此外,我们展示了一小时的微调即可生成令人信服的人机互动视频,凸显了其在具身AI和人机交互任务中的广泛泛化能力。大量消融实验证实,持续的身份-动作绑定是这些提升的关键。综合来看,我们的模型、数据集和基准测试将CVG从单一主体编排提升至可组合控制的多主体互动,为数字制作、模拟和具身智能开辟了新途径。我们的视频演示和代码可在https://DanceTog.github.io/获取。
English
Controllable video generation (CVG) has advanced rapidly, yet current systems
falter when more than one actor must move, interact, and exchange positions
under noisy control signals. We address this gap with DanceTogether, the first
end-to-end diffusion framework that turns a single reference image plus
independent pose-mask streams into long, photorealistic videos while strictly
preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at
every denoising step by fusing robust tracking masks with semantically rich-but
noisy-pose heat-maps, eliminating the identity drift and appearance bleeding
that plague frame-wise pipelines. To train and evaluate at scale, we introduce
(i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii)
HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain
transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the
DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure
skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a
significant margin. Moreover, we show that a one-hour fine-tune yields
convincing human-robot videos, underscoring broad generalization to embodied-AI
and HRI tasks. Extensive ablations confirm that persistent identity-action
binding is critical to these gains. Together, our model, datasets, and
benchmark lift CVG from single-subject choreography to compositionally
controllable, multi-actor interaction, opening new avenues for digital
production, simulation, and embodied intelligence. Our video demos and code are
available at https://DanceTog.github.io/.Summary
AI-Generated Summary