CustomVideo: 複数主体を対象としたテキストからビデオ生成のカスタマイズ

要旨

カスタマイズされたテキストからビデオ生成は、テキストプロンプトと被写体参照に基づいて高品質なビデオを生成することを目指しています。単一の被写体を対象とした現在のアプローチでは、複数の被写体を扱うことが難しく、より挑戦的で実用的なシナリオとなっています。本研究では、複数の被写体をガイドとしたテキストからビデオのカスタマイズを推進することを目指します。我々は、複数の被写体をガイドとしてアイデンティティを保持したビデオを生成できる新しいフレームワークであるCustomVideoを提案します。具体的には、まず、複数の被写体を単一の画像に構成することで、それらの共起を促進します。さらに、基本的なテキストからビデオへの拡散モデルに基づいて、異なる被写体を拡散モデルの潜在空間で分離するためのシンプルかつ効果的なアテンション制御戦略を設計します。また、モデルが特定のオブジェクト領域に集中できるように、参照画像からオブジェクトをセグメント化し、対応するオブジェクトマスクをアテンション学習に提供します。さらに、69の個別の被写体と57の意味のあるペアを含む、複数の被写体を対象としたテキストからビデオ生成のデータセットを包括的なベンチマークとして収集しました。質的、量的、およびユーザー調査の結果は、従来の最先端のアプローチと比較して、我々の手法の優位性を示しています。

English

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific object area, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 69 individual subjects and 57 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method, compared with the previous state-of-the-art approaches.

CustomVideo: 複数主体を対象としたテキストからビデオ生成のカスタマイズ

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

要旨

Support