FlexiAct:迈向异构场景下的灵活动作控制
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
May 6, 2025
作者: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
cs.AI
摘要
动作定制涉及生成视频,其中主体执行由输入控制信号指定的动作。现有方法采用姿态引导或全局运动定制,但受限于空间结构的严格约束,如布局、骨架和视角一致性,降低了在不同主体和场景中的适应性。为克服这些限制,我们提出FlexiAct,它将参考视频中的动作转移至任意目标图像。与现有方法不同,FlexiAct允许参考视频主体与目标图像在布局、视角和骨架结构上存在差异,同时保持身份一致性。实现这一点需要精确的动作控制、空间结构适应和一致性保持。为此,我们引入RefAdapter,一个轻量级的图像条件适配器,在空间适应和一致性保持方面表现出色,在平衡外观一致性和结构灵活性上超越现有方法。此外,基于我们的观察,去噪过程在不同时间步对运动(低频)和外观细节(高频)的关注程度各异。因此,我们提出FAE(频率感知动作提取),与依赖独立时空架构的现有方法不同,它直接在去噪过程中实现动作提取。实验表明,我们的方法能有效将动作转移至具有多样化布局、骨架和视角的主体。我们发布了代码和模型权重以支持进一步研究,访问地址为https://shiyi-zh0408.github.io/projectpages/FlexiAct/。
English
Action customization involves generating videos where the subject performs
actions dictated by input control signals. Current methods use pose-guided or
global motion customization but are limited by strict constraints on spatial
structure, such as layout, skeleton, and viewpoint consistency, reducing
adaptability across diverse subjects and scenarios. To overcome these
limitations, we propose FlexiAct, which transfers actions from a reference
video to an arbitrary target image. Unlike existing methods, FlexiAct allows
for variations in layout, viewpoint, and skeletal structure between the subject
of the reference video and the target image, while maintaining identity
consistency. Achieving this requires precise action control, spatial structure
adaptation, and consistency preservation. To this end, we introduce RefAdapter,
a lightweight image-conditioned adapter that excels in spatial adaptation and
consistency preservation, surpassing existing methods in balancing appearance
consistency and structural flexibility. Additionally, based on our
observations, the denoising process exhibits varying levels of attention to
motion (low frequency) and appearance details (high frequency) at different
timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike
existing methods that rely on separate spatial-temporal architectures, directly
achieves action extraction during the denoising process. Experiments
demonstrate that our method effectively transfers actions to subjects with
diverse layouts, skeletons, and viewpoints. We release our code and model
weights to support further research at
https://shiyi-zh0408.github.io/projectpages/FlexiAct/Summary
AI-Generated Summary