Gen-L-Video：時間的共分散消去によるマルチテキストから長尺動画生成

要旨

大規模な画像-テキストデータセットと拡散モデルの進歩を活用することで、テキスト駆動型生成モデルは画像生成と編集の分野で目覚ましい進歩を遂げてきました。本研究では、このテキスト駆動能力を、複数のテキスト条件付き長尺動画の生成と編集に拡張する可能性を探ります。現在の動画生成と編集の手法は革新的ではあるものの、極めて短い動画（通常24フレーム未満）に限定されており、単一のテキスト条件に制限されています。これらの制約は、現実世界の動画が通常複数のセグメントで構成され、それぞれが異なる意味情報を持つことを考えると、その応用範囲を大幅に制限しています。この課題に対処するため、追加のトレーニングを必要とせずに、多様な意味セグメントを含む数百フレームの動画を生成・編集可能な、Gen-L-Videoと呼ばれる新しいパラダイムを提案します。私たちは、3つの主流なテキスト駆動型動画生成・編集手法を実装し、提案したパラダイムを用いて、多様な意味セグメントを持つ長尺動画に対応するように拡張しました。実験結果から、私たちのアプローチが動画拡散モデルの生成・編集能力を大幅に拡大し、今後の研究と応用に新たな可能性を提供することが明らかになりました。コードはhttps://github.com/G-U-N/Gen-L-Videoで公開されています。

English

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

Gen-L-Video：時間的共分散消去によるマルチテキストから長尺動画生成

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

要旨

Support