FusionFrames: 텍스트-비디오 생성 파이프라인을 위한 효율적인 아키텍처 요소

초록

멀티미디어 생성 접근법은 인공지능 연구에서 중요한 위치를 차지하고 있습니다. 텍스트-이미지 모델은 지난 몇 년 동안 높은 품질의 결과를 달성했습니다. 그러나 비디오 합성 방법은 최근에야 개발되기 시작했습니다. 본 논문은 텍스트-이미지 확산 모델을 기반으로 한 새로운 2단계 잠재 확산 텍스트-비디오 생성 아키텍처를 제시합니다. 첫 번째 단계는 비디오의 스토리라인을 구성하기 위한 키프레임 합성에 관한 것이며, 두 번째 단계는 장면과 객체의 움직임을 부드럽게 하기 위한 보간 프레임 생성에 초점을 맞춥니다. 우리는 키프레임 생성을 위한 여러 시간적 조건화 접근법을 비교합니다. 결과는 비디오 생성 품질 측면과 인간의 선호도를 반영하는 지표에서 시간적 레이어 대신 별도의 시간적 블록을 사용하는 것이 더 우수함을 보여줍니다. 우리의 보간 모델 설계는 다른 마스크된 프레임 보간 접근법에 비해 계산 비용을 크게 줄입니다. 또한, 일관성을 개선하고 더 높은 PSNR, SSIM, MSE, LPIPS 점수를 달성하기 위해 MoVQ 기반 비디오 디코딩 스키마의 다양한 구성을 평가합니다. 마지막으로, 우리는 기존 솔루션과 파이프라인을 비교하여 전체적으로 상위 2위, 오픈소스 솔루션 중에서는 1위를 달성했습니다: CLIPSIM = 0.2976 및 FVD = 433.054. 프로젝트 페이지: https://ai-forever.github.io/kandinsky-video/

English

Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/

FusionFrames: 텍스트-비디오 생성 파이프라인을 위한 효율적인 아키텍처 요소

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

초록

Support