DreamVideo-2: 主体によるゼロショット動画カスタマイズと正確なモーション制御

要旨

最近のカスタマイズされたビデオ生成の進歩により、ユーザーは特定の主題と動きの軌跡に合わせたビデオを作成することが可能になりました。しかしながら、既存の手法はしばしば複雑なテスト時の微調整が必要であり、主題の学習と動きの制御のバランスが難しいため、実世界での応用が制限されています。本論文では、1枚の画像とバウンディングボックスのシーケンスによってそれぞれ誘導され、テスト時の微調整を必要とせずに特定の主題と動きの軌跡を持つビデオを生成することができるゼロショットビデオカスタマイゼーションフレームワークであるDreamVideo-2を提案します。具体的には、モデルの固有の能力を活用するリファレンスアテンションを導入し、バウンディングボックスから導かれるボックスマスクの堅牢な動き信号を完全に活用するためのマスクガイドされた動きモジュールを考案します。これら2つのコンポーネントがそれぞれ意図した機能を果たす一方で、実験的に動きの制御が主題の学習を支配する傾向があることを観察します。この問題に対処するために、次の2つの重要な設計を提案します：1) マスク付きリファレンスアテンション、つまり、リファレンスアテンションに混合された潜在的なマスクモデリングスキームを統合して、所望の位置で主題表現を強化する方法、および2) リウェイトされた拡散損失、つまり、バウンディングボックス内外の領域の寄与を区別して、主題と動きの制御のバランスを確保する方法。新しく収集されたデータセットでの広範な実験結果は、DreamVideo-2が主題のカスタマイゼーションと動きの制御の両方で最先端の手法を凌駕していることを示しています。データセット、コード、およびモデルは公に利用可能になります。

English

Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

DreamVideo-2: 主体によるゼロショット動画カスタマイズと正確なモーション制御

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

要旨

Support