共舞時刻!身份保持型多人互動視頻生成
DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation
May 23, 2025
作者: Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiaoxiao Long, Ruqi Huang
cs.AI
摘要
可控視頻生成(CVG)技術雖已迅速發展,但現有系統在面對多個角色需移動、互動及交換位置,且控制信號存在噪聲時,往往表現不佳。為填補這一空白,我們提出了DanceTogether,這是首個端到端的擴散框架,能夠將單一參考圖像與獨立的姿態掩碼流轉化為長時、逼真的視頻,同時嚴格保持每個角色的身份特徵。創新的MaskPoseAdapter在每一步去噪過程中,通過將穩定的跟踪掩碼與語義豐富但帶噪的姿態熱圖融合,綁定“誰”與“如何”,從而消除了困擾逐幀處理流程的身份漂移和外觀滲透問題。為了大規模訓練與評估,我們引入了(i) PairFS-4K,包含26小時的雙人滑冰視頻,涵蓋7000多個不同身份;(ii) HumanRob-300,一個一小時的人形機器人互動數據集,用於快速跨領域遷移;以及(iii) TogetherVideoBench,一個圍繞DanceTogEval-100測試集的三軌基準,涵蓋舞蹈、拳擊、摔跤、瑜伽和花樣滑冰。在TogetherVideoBench上,DanceTogether顯著超越了先前技術。此外,我們展示了一小時的微調即可生成令人信服的人機視頻,凸顯了其在具身AI和人機交互任務中的廣泛泛化能力。大量消融實驗證實,持續的身份-動作綁定是這些提升的關鍵。總之,我們的模型、數據集和基準將CVG從單一主體編舞提升至可組合控制的多角色互動,為數字製作、模擬及具身智能開闢了新途徑。我們的視頻演示和代碼可在https://DanceTog.github.io/獲取。
English
Controllable video generation (CVG) has advanced rapidly, yet current systems
falter when more than one actor must move, interact, and exchange positions
under noisy control signals. We address this gap with DanceTogether, the first
end-to-end diffusion framework that turns a single reference image plus
independent pose-mask streams into long, photorealistic videos while strictly
preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at
every denoising step by fusing robust tracking masks with semantically rich-but
noisy-pose heat-maps, eliminating the identity drift and appearance bleeding
that plague frame-wise pipelines. To train and evaluate at scale, we introduce
(i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii)
HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain
transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the
DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure
skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a
significant margin. Moreover, we show that a one-hour fine-tune yields
convincing human-robot videos, underscoring broad generalization to embodied-AI
and HRI tasks. Extensive ablations confirm that persistent identity-action
binding is critical to these gains. Together, our model, datasets, and
benchmark lift CVG from single-subject choreography to compositionally
controllable, multi-actor interaction, opening new avenues for digital
production, simulation, and embodied intelligence. Our video demos and code are
available at https://DanceTog.github.io/.Summary
AI-Generated Summary