CineScale：高分辨率電影視覺生成的免費午餐

摘要

視覺擴散模型取得了顯著進展，但由於缺乏高分辨率數據和受限的計算資源，它們通常只能在有限的分辨率下進行訓練，這阻礙了其在更高分辨率下生成高保真圖像或視頻的能力。最近的研究探索了無需調優的策略，以展現預訓練模型在更高分辨率視覺生成方面的潛力。然而，這些方法仍容易產生具有重複模式的低質量視覺內容。關鍵障礙在於，當模型生成超出其訓練分辨率的視覺內容時，高頻信息的不可避免增加會導致累積誤差，從而產生不理想的重複模式。在本研究中，我們提出了CineScale，一種新穎的推理範式，以實現更高分辨率的視覺生成。為應對兩種視頻生成架構引入的各種問題，我們提出了針對每種架構的專用變體。與現有的僅限於高分辨率文本到圖像（T2I）和文本到視頻（T2V）生成的基線方法不同，CineScale通過在頂尖的開源視頻生成框架上實現高分辨率圖像到視頻（I2V）和視頻到視頻（V2V）合成，擴展了應用範圍。大量實驗驗證了我們範式在擴展圖像和視頻模型更高分辨率視覺生成能力方面的優越性。值得注意的是，我們的方法無需任何微調即可實現8k圖像生成，並僅需少量LoRA微調即可實現4k視頻生成。生成的視頻樣本可在我們的網站上查看：https://eyeline-labs.github.io/CineScale/。

English

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.