FSVideo: 고도로 압축된 잠재 공간에서의 고속 비디오 확산 모델

초록

본 논문에서는 고속 동작 변환기(transformer) 기반 이미지-비디오(I2V) 확산 프레임워크인 FSVideo를 소개한다. 우리는 다음 세 가지 핵심 구성 요소를 기반으로 프레임워크를 구축하였다: 1) 높은 압축률의 잠재 공간(공간-시간 하향 샘플링 비율 64배×64배×4)을 가지며 경쟁력 있는 복원 품질을 달성한 새로운 비디오 오토인코더, 2) 계층 간 정보 흐름과 DIT 내 컨텍스트 재사용을 향상시키는 새로운 계층 메모리 설계를 갖춘 확산 변환기(DIT) 아키텍처, 3) 소수 단계의 DIT 업샘플러를 통한 다중 해상도 생성 전략으로 비디오 충실도를 높인다. 140억 파라미터 DIT 기본 모델과 140억 파라미터 DIT 업샘플러로 구성된 최종 모델은 다른 인기 오픈소스 모델 대비 경쟁력 있는 성능을 달성하면서도 생성 속도가 수십 배 빠르다. 본 보고서에서는 모델 설계와 훈련 전략에 대해 논의한다.

English

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

FSVideo: 고도로 압축된 잠재 공간에서의 고속 비디오 확산 모델

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

초록

Support