VideoCrafter1: 고품질 비디오 생성을 위한 오픈 디퓨전 모델

초록

비디오 생성은 학계와 산업계 모두에서 점점 더 많은 관심을 받고 있습니다. 상용 도구들은 그럴듯한 비디오를 생성할 수 있지만, 연구자와 엔지니어들이 사용할 수 있는 오픈소스 모델은 제한적입니다. 본 연구에서는 고품질 비디오 생성을 위한 두 가지 확산 모델, 즉 텍스트-투-비디오(T2V) 모델과 이미지-투-비디오(I2V) 모델을 소개합니다. T2V 모델은 주어진 텍스트 입력을 기반으로 비디오를 합성하며, I2V 모델은 추가적인 이미지 입력을 통합합니다. 우리가 제안한 T2V 모델은 1024x576 해상도의 사실적이고 영화적 품질의 비디오를 생성할 수 있으며, 품질 측면에서 다른 오픈소스 T2V 모델들을 능가합니다. I2V 모델은 제공된 참조 이미지의 내용, 구조, 스타일을 엄격히 준수하는 비디오를 생성하도록 설계되었습니다. 이 모델은 주어진 이미지를 비디오 클립으로 변환하면서 내용 보존 제약 조건을 유지할 수 있는 최초의 오픈소스 I2V 기반 모델입니다. 우리는 이러한 오픈소스 비디오 생성 모델들이 커뮤니티 내 기술 발전에 크게 기여할 것이라고 믿습니다.

English

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of 1024 times 576, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

VideoCrafter1: 고품질 비디오 생성을 위한 오픈 디퓨전 모델

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

초록

Support