Magic-Me: アイデンティティ固有のビデオカスタマイズ拡散モデル

要旨

特定のID（アイデンティティ）に基づくコンテンツ生成は、生成モデルの分野で大きな関心を集めています。テキストから画像を生成する（T2I）分野では、画像内のIDを制御可能な被写体駆動型コンテンツ生成が大きな進展を遂げています。しかし、これをビデオ生成に拡張する試みは十分に探究されていません。本研究では、シンプルでありながら効果的な被写体ID制御可能なビデオ生成フレームワークを提案し、Video Custom Diffusion（VCD）と名付けました。VCDは、少数の画像で定義された特定の被写体IDを基に、ID情報の抽出を強化し、初期化段階でフレーム間の相関を注入することで、IDを大幅に保持した安定したビデオ出力を実現します。これを実現するために、高品質なID保持に不可欠な3つの新規コンポーネントを提案します：1) プロンプトからセグメンテーションによって切り出されたIDを用いて訓練されたIDモジュール。これにより、ID情報と背景ノイズを分離し、より正確なIDトークンの学習を可能にします。2) 3Dガウシアンノイズプライアを用いたテキストからビデオ（T2V）VCDモジュール。これにより、フレーム間の一貫性を向上させます。3) ビデオからビデオ（V2V）Face VCDおよびTiled VCDモジュール。これにより、顔のぼやけを除去し、ビデオを高解像度にアップスケールします。シンプルな構造にもかかわらず、VCDが選択された強力なベースラインを上回る安定した高品質なビデオを生成できることを検証するために、広範な実験を行いました。さらに、IDモジュールの転移性により、VCDは公開されているファインチューニング済みのテキストから画像モデルとも良好に連携し、その有用性をさらに高めています。コードはhttps://github.com/Zhen-Dong/Magic-Meで公開されています。

English

Creating content for a specific identity (ID) has shown significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven content generation has achieved great progress with the ID in the images controllable. However, extending it to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified subject ID defined by a few images, VCD reinforces the identity information extraction and injects frame-wise correlation at the initialization stage for stable video outputs with identity preserved to a large extent. To achieve this, we propose three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution. Despite its simplicity, we conducted extensive experiments to verify that VCD is able to generate stable and high-quality videos with better ID over the selected strong baselines. Besides, due to the transferability of the ID module, VCD is also working well with finetuned text-to-image models available publically, further improving its usability. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

Magic-Me: アイデンティティ固有のビデオカスタマイズ拡散モデル

Magic-Me: Identity-Specific Video Customized Diffusion

要旨

Support