オブジェクトの外観とコンテキストに基づく細粒度制御可能なビデオ生成

要旨

テキストから動画を生成する技術は有望な結果を示しています。しかし、自然言語のみを入力として使用する場合、ユーザーはモデルの出力を精密に制御するための詳細な情報を提供するのに困難を感じることがよくあります。本研究では、詳細な制御を実現するために、細粒度制御可能な動画生成（FACTOR）を提案します。具体的には、FACTORはテキストプロンプトと連携して、オブジェクトの外観やコンテキスト（位置やカテゴリなど）を制御することを目指しています。詳細な制御を実現するために、既存のテキストから動画を生成するモデルに制御信号を統合的に注入する統一フレームワークを提案します。我々のモデルは、共同エンコーダと適応的クロスアテンションレイヤーで構成されています。エンコーダと挿入されたレイヤーを最適化することで、テキストプロンプトと細粒度制御の両方に整合した動画を生成するようにモデルを適応させます。エッジマップのような高密度な制御信号に依存する既存の手法と比較して、我々の方法はより直感的でユーザーフレンドリーなインターフェースを提供し、オブジェクトレベルの細粒度制御を可能にします。我々の手法は、オブジェクトの外観の制御性をファインチューニングなしで実現し、ユーザーが個別の対象ごとに最適化を行う手間を削減します。標準的なベンチマークデータセットとユーザー提供の入力に対する広範な実験により、我々のモデルが競合するベースラインと比較して制御性の指標で70％の改善を達成することが検証されました。

English

Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

オブジェクトの外観とコンテキストに基づく細粒度制御可能なビデオ生成

Fine-grained Controllable Video Generation via Object Appearance and Context

要旨

Support