ChatPaper.aiChatPaper

FlexiAct:迈向异构场景下的灵活动作控制

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

May 6, 2025
作者: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
cs.AI

摘要

动作定制涉及生成视频,其中主体执行由输入控制信号指定的动作。现有方法采用姿态引导或全局运动定制,但受限于空间结构的严格约束,如布局、骨架和视角一致性,降低了在不同主体和场景中的适应性。为克服这些限制,我们提出FlexiAct,它将参考视频中的动作转移至任意目标图像。与现有方法不同,FlexiAct允许参考视频主体与目标图像在布局、视角和骨架结构上存在差异,同时保持身份一致性。实现这一点需要精确的动作控制、空间结构适应和一致性保持。为此,我们引入RefAdapter,一个轻量级的图像条件适配器,在空间适应和一致性保持方面表现出色,在平衡外观一致性和结构灵活性上超越现有方法。此外,基于我们的观察,去噪过程在不同时间步对运动(低频)和外观细节(高频)的关注程度各异。因此,我们提出FAE(频率感知动作提取),与依赖独立时空架构的现有方法不同,它直接在去噪过程中实现动作提取。实验表明,我们的方法能有效将动作转移至具有多样化布局、骨架和视角的主体。我们发布了代码和模型权重以支持进一步研究,访问地址为https://shiyi-zh0408.github.io/projectpages/FlexiAct/。
English
Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

Summary

AI-Generated Summary

PDF251May 7, 2025