InterActHuman:基于布局对齐音频条件的多概念人体动画
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
June 11, 2025
作者: Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin
cs.AI
摘要
近年来,基于多模态条件(如文本、图像和音频)的端到端人体动画技术取得了显著进展。然而,现有方法大多仅能对单一主体进行动画处理,并以全局方式注入条件,忽视了同一视频中可能出现多个概念、包含丰富的人与人及人与物交互的场景。这种全局假设阻碍了对包括人类和物体在内的多个概念进行精确的、针对每个身份的独立控制,从而限制了应用范围。在本研究中,我们摒弃了单一实体的假设,提出了一种新颖的框架,该框架强制实现了从多模态条件到每个身份时空轨迹的强区域特异性绑定。给定多个概念的参考图像,我们的方法能够通过利用掩码预测器匹配去噪视频与每个参考外观之间的视觉线索,自动推断布局信息。此外,我们以迭代方式将局部音频条件注入其对应区域,确保布局对齐的模态匹配。这一设计实现了高质量、可控的多概念以人为中心视频生成。实证结果与消融研究验证了相较于隐式方法及其他现有技术,我们提出的显式布局控制对于多模态条件的有效性。
English
End-to-end human animation with rich multi-modal conditions, e.g., text,
image and audio has achieved remarkable advancements in recent years. However,
most existing methods could only animate a single subject and inject conditions
in a global manner, ignoring scenarios that multiple concepts could appears in
the same video with rich human-human interactions and human-object
interactions. Such global assumption prevents precise and per-identity control
of multiple concepts including humans and objects, therefore hinders
applications. In this work, we discard the single-entity assumption and
introduce a novel framework that enforces strong, region-specific binding of
conditions from modalities to each identity's spatiotemporal footprint. Given
reference images of multiple concepts, our method could automatically infer
layout information by leveraging a mask predictor to match appearance cues
between the denoised video and each reference appearance. Furthermore, we
inject local audio condition into its corresponding region to ensure
layout-aligned modality matching in a iterative manner. This design enables the
high-quality generation of controllable multi-concept human-centric videos.
Empirical results and ablation studies validate the effectiveness of our
explicit layout control for multi-modal conditions compared to implicit
counterparts and other existing methods.