无需训练即可对视频内容、动作与动态进行多样化编辑

摘要

近年来，受控视频生成技术取得了显著进展。然而，对现实世界视频中的动作与动态事件进行编辑，或插入会影响其他物体行为的内容，仍然是一项重大挑战。现有训练模型难以处理复杂编辑任务，这很可能源于相关训练数据收集的困难。同样，现有的免训练方法本质上受限于保持结构和运动的编辑操作，无法支持对运动或交互关系的修改。本文提出DynaEdit这一免训练编辑方法，通过预训练的文本-视频流模型实现多功能视频编辑能力。我们的方法基于近期提出的免反演技术，该技术不干预模型内部结构，因此具备模型无关性。研究发现，直接将该技术应用于无约束通用编辑会导致严重的低频错位和高频抖动问题。我们分析了这些现象的产生根源，并提出了创新机制予以克服。大量实验表明，DynaEdit在基于文本的复杂视频编辑任务（包括动作修改、插入与场景交互的物体以及添加全局特效）上达到了最先进水平。

English

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

无需训练即可对视频内容、动作与动态进行多样化编辑

Versatile Editing of Video Content, Actions, and Dynamics without Training

摘要

Support