FlexiAct:邁向異質場景中的靈活動作控制
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
May 6, 2025
作者: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
cs.AI
摘要
動作定制涉及生成由輸入控制信號指示主體執行特定動作的視頻。現有方法多採用姿態引導或全局運動定制,但受限於對空間結構(如佈局、骨架和視角一致性)的嚴格約束,降低了在不同主體和場景中的適應性。為克服這些限制,我們提出了FlexiAct,它能夠將參考視頻中的動作轉移至任意目標圖像。與現有方法不同,FlexiAct允許參考視頻主體與目標圖像之間在佈局、視角和骨骼結構上存在差異,同時保持身份一致性。實現這一點需要精確的動作控制、空間結構適應以及一致性保持。為此,我們引入了RefAdapter,這是一個輕量級的圖像條件適配器,在空間適應和一致性保持方面表現優異,在平衡外觀一致性和結構靈活性上超越了現有方法。此外,基於我們的觀察,去噪過程在不同時間步長對運動(低頻)和外觀細節(高頻)表現出不同程度的關注。因此,我們提出了FAE(頻率感知動作提取),與依賴分離時空架構的現有方法不同,它直接在去噪過程中實現動作提取。實驗表明,我們的方法能有效地將動作轉移至具有多樣佈局、骨架和視角的主體上。我們已公開發布代碼和模型權重,以支持進一步研究,詳見https://shiyi-zh0408.github.io/projectpages/FlexiAct/。
English
Action customization involves generating videos where the subject performs
actions dictated by input control signals. Current methods use pose-guided or
global motion customization but are limited by strict constraints on spatial
structure, such as layout, skeleton, and viewpoint consistency, reducing
adaptability across diverse subjects and scenarios. To overcome these
limitations, we propose FlexiAct, which transfers actions from a reference
video to an arbitrary target image. Unlike existing methods, FlexiAct allows
for variations in layout, viewpoint, and skeletal structure between the subject
of the reference video and the target image, while maintaining identity
consistency. Achieving this requires precise action control, spatial structure
adaptation, and consistency preservation. To this end, we introduce RefAdapter,
a lightweight image-conditioned adapter that excels in spatial adaptation and
consistency preservation, surpassing existing methods in balancing appearance
consistency and structural flexibility. Additionally, based on our
observations, the denoising process exhibits varying levels of attention to
motion (low frequency) and appearance details (high frequency) at different
timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike
existing methods that rely on separate spatial-temporal architectures, directly
achieves action extraction during the denoising process. Experiments
demonstrate that our method effectively transfers actions to subjects with
diverse layouts, skeletons, and viewpoints. We release our code and model
weights to support further research at
https://shiyi-zh0408.github.io/projectpages/FlexiAct/Summary
AI-Generated Summary