DisPose: 制御可能な人物画像アニメーションのためのポーズガイダンスの分離

要旨

制御可能な人物画像アニメーションは、リファレンス画像からドライビング動画を使用してビデオを生成することを目的としています。疎なガイダンス（例：スケルトンポーズ）によって提供される制御信号が限られているため、最近の研究では、動きの整合性を確保するために追加の密な条件（例：深度マップ）を導入しようと試みてきました。しかし、リファレンスキャラクターの体形がドライビング動画と大きく異なる場合、そのような厳密な密なガイダンスは生成されるビデオの品質に悪影響を与えます。本論文では、追加の密な入力なしにより一般化可能で効果的な制御信号を探索するDisPoseを提案します。これにより、人物画像アニメーションの疎なスケルトンポーズを動きのフィールドガイダンスとキーポイント対応に分解します。具体的には、疎な動きのフィールドとリファレンス画像から密な動きのフィールドを生成し、領域レベルの密なガイダンスを提供しつつ、疎なポーズ制御の一般化を維持します。また、リファレンス画像からポーズキーポイントに対応する拡散特徴を抽出し、これらのポイント特徴をターゲットポーズに転送して独自のアイデンティティ情報を提供します。既存のモデルにシームレスに統合するために、既存のモデルパラメータを凍結しながら生成されるビデオの品質と一貫性を向上させるプラグアンドプレイのハイブリッド制御ネットワークを提案します。包括的な定性的および定量的実験により、DisPoseの現行手法に比べた優越性が示されています。コード：https://github.com/lihxxx/DisPose{https://github.com/lihxxx/DisPose}。

English

Controllable human image animation aims to generate videos from reference images using driving videos. Due to the limited control signals provided by sparse guidance (e.g., skeleton pose), recent works have attempted to introduce additional dense conditions (e.g., depth map) to ensure motion alignment. However, such strict dense guidance impairs the quality of the generated video when the body shape of the reference character differs significantly from that of the driving video. In this paper, we present DisPose to mine more generalizable and effective control signals without additional dense input, which disentangles the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. Specifically, we generate a dense motion field from a sparse motion field and the reference image, which provides region-level dense guidance while maintaining the generalization of the sparse pose control. We also extract diffusion features corresponding to pose keypoints from the reference image, and then these point features are transferred to the target pose to provide distinct identity information. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet that improves the quality and consistency of generated videos while freezing the existing model parameters. Extensive qualitative and quantitative experiments demonstrate the superiority of DisPose compared to current methods. Code: https://github.com/lihxxx/DisPose{https://github.com/lihxxx/DisPose}.

DisPose: 制御可能な人物画像アニメーションのためのポーズガイダンスの分離

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

要旨

Support