InterActHuman：基於佈局對齊音頻的多概念人體動畫條件

摘要

近年來，基於豐富多模態條件（如文本、圖像和音頻）的端到端人體動畫技術取得了顯著進展。然而，現有方法大多僅能對單一主體進行動畫處理，並以全局方式注入條件，忽視了同一視頻中可能出現多個概念、包含豐富人際互動及人物互動的場景。這種全局假設阻礙了對包括人與物在內的多個概念進行精確且針對個體的控制，從而限制了應用範圍。本研究摒棄了單一實體的假設，提出了一種新穎框架，該框架強制性地將來自各模態的條件與每個身份的時空足跡進行區域特定的綁定。給定多個概念的參考圖像，我們的方法能夠通過利用掩碼預測器來匹配去噪視頻與每個參考外觀之間的外觀線索，自動推斷佈局信息。此外，我們以迭代方式將局部音頻條件注入其對應區域，確保佈局對齊的模態匹配。這一設計實現了高質量、可控的多概念以人為中心視頻的生成。實證結果與消融研究驗證了我們在多模態條件下顯式佈局控制相較於隱式方法及其他現有方法的有效性。

English

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

InterActHuman：基於佈局對齊音頻的多概念人體動畫條件

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

摘要

Support