텍스트-투-비디오 생성을 위한 계층적 시공간 디커플링

초록

확산 모델이 사실적인 이미지를 생성하는 강력한 능력을 보여주었음에도 불구하고, 현실적이고 다양한 비디오를 생성하는 것은 여전히 초기 단계에 머물러 있습니다. 주요 이유 중 하나는 현재의 방법들이 공간적 내용과 시간적 동역학을 서로 얽히게 하여 텍스트-비디오 생성(T2V)의 복잡성이 크게 증가하기 때문입니다. 본 연구에서는 HiGen이라는 확산 모델 기반 방법을 제안하며, 이는 구조적 수준과 내용적 수준 두 가지 관점에서 비디오의 공간적 및 시간적 요소를 분리함으로써 성능을 향상시킵니다. 구조적 수준에서는 T2V 작업을 공간적 추론과 시간적 추론 두 단계로 분해하며, 이를 위해 통합된 디노이저를 사용합니다. 구체적으로, 공간적 추론 단계에서 텍스트를 사용하여 공간적으로 일관된 사전 정보를 생성하고, 시간적 추론 단계에서 이러한 사전 정보로부터 시간적으로 일관된 움직임을 생성합니다. 내용적 수준에서는 입력 비디오의 내용에서 움직임과 외관 변화를 각각 표현할 수 있는 두 가지 미묘한 단서를 추출합니다. 이 두 단서는 비디오 생성을 위한 모델의 학습을 안내하여 유연한 내용 변화를 가능하게 하고 시간적 안정성을 강화합니다. 이러한 분리된 패러다임을 통해 HiGen은 이 작업의 복잡성을 효과적으로 줄이고 의미적 정확성과 움직임 안정성을 갖춘 현실적인 비디오를 생성할 수 있습니다. 광범위한 실험을 통해 HiGen이 최신 T2V 방법들을 능가하는 우수한 성능을 보여줌을 입증합니다.

English

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

텍스트-투-비디오 생성을 위한 계층적 시공간 디커플링

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

초록

Support