TokenFlow: Caratteristiche di Diffusione Consistenti per l'Editing Video Coerente

Abstract

La rivoluzione dell'IA generativa si è recentemente estesa ai video. Tuttavia, i modelli video allo stato dell'arte attuale sono ancora in ritardo rispetto ai modelli per immagini in termini di qualità visiva e controllo dell'utente sul contenuto generato. In questo lavoro, presentiamo un framework che sfrutta la potenza di un modello di diffusione testo-immagine per il compito di editing video guidato da testo. Nello specifico, dato un video sorgente e un prompt testuale di destinazione, il nostro metodo genera un video di alta qualità che aderisce al testo di destinazione, preservando al contempo il layout spaziale e il movimento del video di input. Il nostro metodo si basa su un'osservazione chiave: la coerenza nel video modificato può essere ottenuta imponendo coerenza nello spazio delle feature di diffusione. Raggiungiamo questo obiettivo propagando esplicitamente le feature di diffusione basate su corrispondenze inter-fotogramma, già disponibili nel modello. Pertanto, il nostro framework non richiede alcun addestramento o fine-tuning e può funzionare in combinazione con qualsiasi metodo di editing testo-immagine disponibile sul mercato. Dimostriamo risultati di editing all'avanguardia su una varietà di video del mondo reale. Pagina web: https://diffusion-tokenflow.github.io/

English

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

TokenFlow: Caratteristiche di Diffusione Consistenti per l'Editing Video Coerente

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Abstract

Support