VSTAR: 더 긴 동적 비디오 생성을 위한 생성적 시간 간호

초록

텍스트-투-비디오(T2V) 합성 분야에서 엄청난 진전이 있었음에도 불구하고, 오픈소스 T2V 확산 모델은 동적으로 변화하고 진화하는 콘텐츠를 가진 더 긴 비디오를 생성하는 데 어려움을 겪고 있습니다. 이러한 모델들은 준정적(quasi-static) 비디오를 합성하는 경향이 있어, 텍스트 프롬프트에 내포된 시간에 따른 시각적 변화를 무시합니다. 동시에, 더 길고 동적인 비디오 합성을 가능하게 하기 위해 이러한 모델을 확장하는 것은 종종 계산적으로 불가능에 가깝습니다. 이러한 문제를 해결하기 위해, 우리는 생성적 시간 간호(Generative Temporal Nursing, GTN)라는 개념을 소개합니다. GTN은 추론 과정 중에 생성 과정을 실시간으로 변경하여 시간적 역학에 대한 제어를 개선하고 더 긴 비디오 생성을 가능하게 하는 것을 목표로 합니다. 우리는 GTN을 위한 방법으로 VSTAR를 제안하며, 이는 두 가지 핵심 요소로 구성됩니다: 1) 비디오 시놉시스 프롬프팅(Video Synopsis Prompting, VSP) - 원본 단일 프롬프트를 기반으로 LLM을 활용하여 비디오 시놉시스를 자동으로 생성함으로써 더 긴 비디오의 다양한 시각적 상태에 대한 정확한 텍스트 가이드를 제공하고, 2) 시간적 주의 규제(Temporal Attention Regularization, TAR) - 사전 훈련된 T2V 확산 모델의 시간적 주의 단위를 개선하기 위한 규제 기법으로, 비디오 역학에 대한 제어를 가능하게 합니다. 우리는 실험을 통해 제안된 접근 방식이 기존의 오픈소스 T2V 모델보다 더 길고 시각적으로 매력적인 비디오를 생성하는 데 있어 우수성을 입증합니다. 또한, VSTAR 적용 여부에 따른 시간적 주의 맵을 분석하여, 시간에 따른 원하는 시각적 변화를 무시하는 문제를 완화하기 위해 우리의 방법을 적용하는 것의 중요성을 보여줍니다.

English

Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

VSTAR: 더 긴 동적 비디오 생성을 위한 생성적 시간 간호

VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

초록

Support