CineScale: 고해상도 시네마틱 비주얼 생성에서의 프리 런치

초록

시각적 확산 모델은 놀라운 발전을 이루었으나, 고해상도 데이터의 부족과 제한된 계산 자원으로 인해 일반적으로 제한된 해상도로 훈련됩니다. 이는 더 높은 해상도에서 고품질의 이미지나 비디오를 생성하는 능력을 저해합니다. 최근 연구에서는 사전 훈련된 모델의 잠재적인 고해상도 시각적 생성 능력을 발휘하기 위해 튜닝이 필요 없는 전략을 탐구했습니다. 그러나 이러한 방법들은 여전히 반복적인 패턴을 가진 저품질의 시각적 콘텐츠를 생성하는 경향이 있습니다. 주요 장애물은 모델이 훈련 해상도를 초과하는 시각적 콘텐츠를 생성할 때 필연적으로 증가하는 고주파 정보로 인해, 누적된 오류로부터 발생하는 바람직하지 않은 반복 패턴이 발생한다는 점입니다. 본 연구에서는 더 높은 해상도의 시각적 생성을 가능하게 하는 새로운 추론 패러다임인 CineScale을 제안합니다. 두 가지 유형의 비디오 생성 아키텍처에서 발생하는 다양한 문제를 해결하기 위해, 각각에 맞춤화된 변형을 제안합니다. 기존의 베이스라인 방법들이 고해상도 T2I(Text-to-Image) 및 T2V(Text-to-Video) 생성에 국한된 반면, CineScale은 최첨단 오픈소스 비디오 생성 프레임워크를 기반으로 고해상도 I2V(Image-to-Video) 및 V2V(Video-to-Video) 합성을 가능하게 함으로써 범위를 확장합니다. 광범위한 실험을 통해 우리의 패러다임이 이미지 및 비디오 모델 모두에 대해 더 높은 해상도의 시각적 생성 능력을 확장하는 데 있어 우수성을 입증했습니다. 특히, 우리의 접근 방식은 어떠한 미세 조정 없이도 8K 이미지 생성을 가능하게 하며, 최소한의 LoRA 미세 조정만으로 4K 비디오 생성을 달성합니다. 생성된 비디오 샘플은 우리의 웹사이트에서 확인할 수 있습니다: https://eyeline-labs.github.io/CineScale/.

English

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.