TokenFlow: 일관된 비디오 편집을 위한 일관된 디퓨전 특성

초록

생성형 AI 혁명이 최근 비디오 분야로 확장되었습니다. 그러나 현재 최첨단 비디오 모델들은 시각적 품질과 생성 콘텐츠에 대한 사용자 제어 측면에서 여전히 이미지 모델에 뒤처져 있습니다. 본 연구에서는 텍스트 기반 비디오 편집 작업을 위해 텍스트-이미지 확산 모델의 힘을 활용하는 프레임워크를 제안합니다. 구체적으로, 소스 비디오와 타겟 텍스트 프롬프트가 주어졌을 때, 우리의 방법은 입력 비디오의 공간적 레이아웃과 움직임을 보존하면서 타겟 텍스트에 부합하는 고품질 비디오를 생성합니다. 우리의 방법은 편집된 비디오의 일관성이 확산 특징 공간에서의 일관성을 강제함으로써 얻을 수 있다는 핵심 관찰에 기반합니다. 이를 위해 모델에서 쉽게 얻을 수 있는 프레임 간 대응 관계를 기반으로 확산 특징을 명시적으로 전파합니다. 따라서 우리의 프레임워크는 어떠한 학습이나 미세 조정도 필요하지 않으며, 기존의 텍스트-이미지 편집 방법과 함께 사용할 수 있습니다. 우리는 다양한 실제 비디오에 대해 최첨단 편집 결과를 보여줍니다. 웹페이지: https://diffusion-tokenflow.github.io/

English

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

TokenFlow: 일관된 비디오 편집을 위한 일관된 디퓨전 특성

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

초록

Support