ターゲット対応型ビデオ拡散モデル

要旨

入力画像から、指定されたターゲットと相互作用しながら所望のアクションを実行するアクターを含む動画を生成するターゲット認識型ビデオ拡散モデルを提案します。ターゲットはセグメンテーションマスクによって定義され、所望のアクションはテキストプロンプトで記述されます。既存の制御可能な画像から動画への拡散モデルでは、アクターの動きをターゲットに向けて誘導するために密な構造的または運動的な手がかりを必要とすることが多いのに対し、我々のターゲット認識型モデルは、ターゲットを示すための単純なマスクのみを必要とし、事前学習済みモデルの汎化能力を活用して妥当なアクションを生成します。これにより、正確なアクションガイダンスを提供することが難しいヒトと物体の相互作用（HOI）シナリオにおいて特に有効であり、さらにロボティクスなどのアプリケーションにおける高レベルのアクションプランニングのためにビデオ拡散モデルを使用することが可能になります。我々は、ベースラインモデルを拡張してターゲットマスクを追加の入力として組み込むことで、ターゲット認識型モデルを構築します。ターゲット認識を強化するために、テキストプロンプト内でターゲットの空間情報をエンコードする特別なトークンを導入します。次に、このトークンに関連するクロスアテンションマップを入力ターゲットマスクと整合させる新しいクロスアテンション損失を使用して、キュレートされたデータセットでモデルを微調整します。さらに性能を向上させるために、この損失を最も意味的に関連性の高いトランスフォーマーブロックとアテンション領域に選択的に適用します。実験結果は、我々のターゲット認識型モデルが、アクターが指定されたターゲットと正確に相互作用する動画を生成する点で既存のソリューションを上回ることを示しています。さらに、ビデオコンテンツ作成とゼロショット3D HOIモーション合成という2つの下流アプリケーションにおける有効性を実証します。

English

We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

ターゲット対応型ビデオ拡散モデル

Target-Aware Video Diffusion Models

要旨

Support