DynVFX: 動的コンテンツを用いたリアルなビデオの拡張

要旨

我々は、実世界のビデオに新しく生成されたダイナミックなコンテンツを付加する方法を提案します。入力ビデオとユーザーが提供したシンプルなテキスト指示に基づき、我々の手法は既存のシーンと自然に相互作用するダイナミックなオブジェクトや複雑なシーン効果を合成します。新しいコンテンツの位置、外観、動きは、カメラの動き、遮蔽、シーン内の他のダイナミックなオブジェクトとの相互作用を考慮しながら、オリジナルの映像にシームレスに統合され、一体感のあるリアルな出力ビデオが生成されます。これは、事前トレーニング不要のゼロショットフレームワークを使用して実現され、事前トレーニング済みのテキストからビデオへの拡散トランスフォーマーを活用して新しいコンテンツを合成し、詳細に拡張されたシーンを想像するために事前トレーニング済みのビジョン言語モデルを使用しています。具体的には、注目メカニズム内の特徴を操作する新しい推論ベースの手法を導入し、新しいコンテンツの正確な位置特定とシームレスな統合を実現し、オリジナルシーンの完全性を保持します。我々の手法は完全に自動化されており、単純なユーザー指示のみが必要です。我々は、実世界のビデオに適用された幅広い編集においてその効果を実証し、カメラとオブジェクトの動きの両方を含む多様なオブジェクトとシナリオをカバーしています。

English

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.