훈련 없이 비디오 콘텐츠, 동작 및 역학의 다양성 있는 편집

초록

최근 들어 제어 가능한 비디오 생성 기술은 급격한 발전을 이루었습니다. 그러나 실제 영상에서 동작 및 동적 사건을 편집하거나 다른 객체의 행동에 영향을 미쳐야 할 콘텐츠를 삽입하는 작업은 여전히 큰 과제로 남아 있습니다. 기존 훈련된 모델들은 복잡한 편집 작업에 어려움을 겪는데, 이는 관련 훈련 데이터 수집의 어려움에서 기인한 것으로 보입니다. 마찬가지로, 기존의 훈련 불필요 방법론들은 구조 및 동작 보존 편집에 본질적으로 제한되어 있으며, 동작이나 상호작용 수정을 지원하지 않습니다. 본 연구에서는 사전 훈련된 텍스트-비디오 흐름 모델을 통해 다양한 비디오 편집 기능을 구현하는 훈련 불필요 편집 방법인 DynaEdit을 소개합니다. 우리의 방법은 모델 내부에 개입하지 않는 최근 도입된 인버전-프리 접근법에 기반하여, 모델에 구애받지 않습니다. 우리는 이 접근법을 일반적인 비제약 편집에 적용하려는 단순한 시도가 심각한 저주파수 정렬 오류와 고주파수 지터를 초래함을 보여줍니다. 우리는 이러한 현상의 원인을 설명하고 이를 극복하기 위한 새로운 메커니즘을 제시합니다. 광범위한 실험을 통해 DynaEdit이 동작 수정, 장면과 상호작용하는 객체 삽입, 전역 효과 도입 등을 포함한 복잡한 텍스트 기반 비디오 편집 작업에서 최첨단 성능을 달성함을 입증합니다.

English

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

훈련 없이 비디오 콘텐츠, 동작 및 역학의 다양성 있는 편집

Versatile Editing of Video Content, Actions, and Dynamics without Training

초록

Support