CineScale:高分辨率電影視覺生成的免費午餐
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
August 21, 2025
作者: Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu
cs.AI
摘要
視覺擴散模型取得了顯著進展,但由於缺乏高分辨率數據和受限的計算資源,它們通常只能在有限的分辨率下進行訓練,這阻礙了其在更高分辨率下生成高保真圖像或視頻的能力。最近的研究探索了無需調優的策略,以展現預訓練模型在更高分辨率視覺生成方面的潛力。然而,這些方法仍容易產生具有重複模式的低質量視覺內容。關鍵障礙在於,當模型生成超出其訓練分辨率的視覺內容時,高頻信息的不可避免增加會導致累積誤差,從而產生不理想的重複模式。在本研究中,我們提出了CineScale,一種新穎的推理範式,以實現更高分辨率的視覺生成。為應對兩種視頻生成架構引入的各種問題,我們提出了針對每種架構的專用變體。與現有的僅限於高分辨率文本到圖像(T2I)和文本到視頻(T2V)生成的基線方法不同,CineScale通過在頂尖的開源視頻生成框架上實現高分辨率圖像到視頻(I2V)和視頻到視頻(V2V)合成,擴展了應用範圍。大量實驗驗證了我們範式在擴展圖像和視頻模型更高分辨率視覺生成能力方面的優越性。值得注意的是,我們的方法無需任何微調即可實現8k圖像生成,並僅需少量LoRA微調即可實現4k視頻生成。生成的視頻樣本可在我們的網站上查看:https://eyeline-labs.github.io/CineScale/。
English
Visual diffusion models achieve remarkable progress, yet they are typically
trained at limited resolutions due to the lack of high-resolution data and
constrained computation resources, hampering their ability to generate
high-fidelity images or videos at higher resolutions. Recent efforts have
explored tuning-free strategies to exhibit the untapped potential
higher-resolution visual generation of pre-trained models. However, these
methods are still prone to producing low-quality visual content with repetitive
patterns. The key obstacle lies in the inevitable increase in high-frequency
information when the model generates visual content exceeding its training
resolution, leading to undesirable repetitive patterns deriving from the
accumulated errors. In this work, we propose CineScale, a novel inference
paradigm to enable higher-resolution visual generation. To tackle the various
issues introduced by the two types of video generation architectures, we
propose dedicated variants tailored to each. Unlike existing baseline methods
that are confined to high-resolution T2I and T2V generation, CineScale broadens
the scope by enabling high-resolution I2V and V2V synthesis, built atop
state-of-the-art open-source video generation frameworks. Extensive experiments
validate the superiority of our paradigm in extending the capabilities of
higher-resolution visual generation for both image and video models.
Remarkably, our approach enables 8k image generation without any fine-tuning,
and achieves 4k video generation with only minimal LoRA fine-tuning. Generated
video samples are available at our website:
https://eyeline-labs.github.io/CineScale/.