CineScale: 高解像度シネマティックビジュアル生成におけるフリーランチ

要旨

視覚拡散モデルは目覚ましい進歩を遂げていますが、高解像度データの不足や計算リソースの制約により、通常は限られた解像度で学習されるため、高解像度での高忠実度な画像や動画の生成能力が制限されています。最近の研究では、事前学習済みモデルの未開拓の高解像度視覚生成の可能性を引き出すためのチューニング不要な戦略が探求されています。しかし、これらの手法は依然として繰り返しパターンを含む低品質な視覚コンテンツを生成しがちです。その主な障害は、モデルが学習解像度を超える視覚コンテンツを生成する際に、高周波情報が必然的に増加し、蓄積された誤差から生じる望ましくない繰り返しパターンが発生することにあります。本研究では、高解像度視覚生成を可能にする新しい推論パラダイムであるCineScaleを提案します。2種類の動画生成アーキテクチャによって引き起こされる様々な問題に対処するため、それぞれに特化したバリアントを提案します。高解像度のT2I（テキストから画像）およびT2V（テキストから動画）生成に限定されている既存のベースライン手法とは異なり、CineScaleは最先端のオープンソース動画生成フレームワークを基盤として、高解像度のI2V（画像から動画）およびV2V（動画から動画）合成を可能にすることで、その範囲を拡大します。広範な実験により、画像モデルと動画モデルの両方において、高解像度視覚生成の能力を拡張する当パラダイムの優位性が検証されました。特に、我々のアプローチは、微調整なしで8K画像生成を可能にし、最小限のLoRA微調整で4K動画生成を実現します。生成された動画サンプルは、当ウェブサイトでご覧いただけます：https://eyeline-labs.github.io/CineScale/。

English

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.