DynVFX：將真實影片增強為動態內容

摘要

我們提出了一種方法，用於將真實世界的影片與新生成的動態內容相結合。給定一個輸入影片和一個簡單的用戶提供的描述所需內容的文本指令，我們的方法合成動態物件或複雜的場景效果，這些效果會與現有場景自然地互動。新內容的位置、外觀和運動無縫地融入原始影片中，同時考慮攝像機運動、遮擋和與場景中其他動態物件的互動，從而產生一個連貫且逼真的輸出影片。我們通過一個零-shot、無需訓練的框架實現這一點，該框架利用預先訓練的文本到影片擴散變壓器來合成新內容，並利用預先訓練的視覺語言模型詳細展望擴增場景。具體來說，我們引入了一種新穎的基於推理的方法，該方法在注意機制內操作特徵，實現新內容的準確定位和無縫集成，同時保持原始場景的完整性。我們的方法完全自動化，僅需要簡單的用戶指令。我們展示了它對應用於真實世界影片的各種編輯的有效性，這些編輯涉及各種物件和情境，包括攝像機和物件運動。

English

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

DynVFX：將真實影片增強為動態內容

DynVFX: Augmenting Real Videos with Dynamic Content

摘要

Support