VSTAR: 長時間ダイナミックビデオ合成のための生成的時間的ナーチャリング

要旨

テキストからビデオ（T2V）合成の分野における大きな進展にもかかわらず、オープンソースのT2V拡散モデルは、動的に変化し進化するコンテンツを含む長いビデオを生成するのに苦戦しています。これらのモデルは、テキストプロンプトに含まれる時間経過に伴う視覚的変化を無視し、準静的なビデオを合成する傾向があります。同時に、より長くダイナミックなビデオ合成を可能にするためにこれらのモデルをスケーリングすることは、しばしば計算上不可能です。この課題に対処するため、我々は「Generative Temporal Nursing（GTN）」という概念を導入し、推論中に生成プロセスを動的に変更することで、時間的ダイナミクスに対する制御を改善し、より長いビデオの生成を可能にします。我々はGTNの手法として「VSTAR」を提案し、これには2つの主要な要素が含まれます：1) Video Synopsis Prompting（VSP）—元の単一プロンプトを基にLLMを活用してビデオのシノプシスを自動生成し、長いビデオの異なる視覚的状態に対する正確なテキストガイダンスを提供する、2) Temporal Attention Regularization（TAR）—事前学習済みのT2V拡散モデルの時間的注意ユニットを洗練する正則化技術で、ビデオのダイナミクスを制御可能にします。我々は実験的に、提案手法が既存のオープンソースT2Vモデルよりも長く視覚的に魅力的なビデオを生成する優位性を示します。さらに、VSTARの適用前後の時間的注意マップを分析し、望ましい視覚的変化の無視を軽減するために本手法を適用することの重要性を実証します。

English

Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

VSTAR: 長時間ダイナミックビデオ合成のための生成的時間的ナーチャリング

VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

要旨

Support