FlexiAct：異種シナリオにおける柔軟なアクション制御に向けて

要旨

アクションカスタマイズは、入力制御信号によって指示された動作を被写体が行う動画を生成することを含みます。現在の手法では、ポーズガイドやグローバルモーションカスタマイズが使用されていますが、レイアウト、骨格、視点の一貫性など、空間構造に関する厳格な制約によって制限されており、多様な被写体やシナリオへの適応性が低下しています。これらの制限を克服するために、我々はFlexiActを提案します。FlexiActは、参照動画から任意のターゲット画像へ動作を転送します。既存の手法とは異なり、FlexiActは、参照動画の被写体とターゲット画像の間でレイアウト、視点、骨格構造の変動を許容しつつ、アイデンティティの一貫性を維持します。これを実現するためには、正確な動作制御、空間構造の適応、および一貫性の保持が必要です。この目的のために、我々はRefAdapterを導入します。RefAdapterは、空間適応と一貫性保持に優れた軽量な画像条件付きアダプタであり、外観の一貫性と構造の柔軟性のバランスにおいて既存の手法を凌駕します。さらに、我々の観察に基づき、ノイズ除去プロセスは、異なるタイムステップにおいて、モーション（低周波数）と外観の詳細（高周波数）に対して異なるレベルの注意を払うことがわかります。そこで、我々はFAE（Frequency-aware Action Extraction）を提案します。FAEは、既存の手法とは異なり、空間-時間アーキテクチャを分離することなく、ノイズ除去プロセス中に直接動作抽出を実現します。実験により、我々の手法が、多様なレイアウト、骨格、視点を持つ被写体に対して効果的に動作を転送することが示されています。我々は、さらなる研究を支援するために、コードとモデルウェイトをhttps://shiyi-zh0408.github.io/projectpages/FlexiAct/で公開しています。

English

Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

FlexiAct：異種シナリオにおける柔軟なアクション制御に向けて

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

要旨

Support