VideoElevator: 다용도 텍스트-이미지 확산 모델을 활용한 비디오 생성 품질 향상

초록

텍스트-이미지 확산 모델(T2I)은 사실적이고 미학적인 이미지를 생성하는 데 있어 전례 없는 능력을 보여주고 있습니다. 반면, 텍스트-비디오 확산 모델(T2V)은 여전히 프레임 품질과 텍스트 정렬 면에서 크게 뒤처져 있는데, 이는 훈련 비디오의 품질과 양이 부족하기 때문입니다. 본 논문에서는 T2I의 우수한 능력을 활용하여 T2V의 성능을 향상시키는, 훈련이 필요 없고 플러그 앤 플레이 방식의 VideoElevator를 소개합니다. 기존의 T2V 샘플링(즉, 시간적 및 공간적 모델링)과 달리, VideoElevator는 각 샘플링 단계를 시간적 모션 정제와 공간적 품질 향상으로 명시적으로 분해합니다. 구체적으로, 시간적 모션 정제는 캡슐화된 T2V를 사용하여 시간적 일관성을 강화한 후, T2I가 요구하는 잡음 분포로 역변환합니다. 그런 다음, 공간적 품질 향상은 확장된 T2I를 활용하여 덜 노이즈가 있는 잠재 변수를 직접 예측함으로써 더욱 사실적인 세부 사항을 추가합니다. 다양한 T2V와 T2I의 조합 하에서 광범위한 프롬프트에 대한 실험을 수행했습니다. 결과는 VideoElevator가 기본 T2I를 사용하여 T2V 베이스라인의 성능을 개선할 뿐만 아니라, 개인화된 T2I를 사용하여 스타일리시한 비디오 합성을 용이하게 한다는 것을 보여줍니다. 우리의 코드는 https://github.com/YBYBZhang/VideoElevator에서 확인할 수 있습니다.

English

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

VideoElevator: 다용도 텍스트-이미지 확산 모델을 활용한 비디오 생성 품질 향상

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

초록

Support