BeyondScene: 事前学習済み拡散モデルによる高解像度人間中心シーン生成

要旨

詳細な制御を伴う高解像度の人間中心のシーン生成は、既存のテキストから画像への拡散モデルにとって依然として課題となっています。この課題は、限られた訓練画像サイズ、テキストエンコーダの容量（トークン数の制限）、および複数の人間が関与する複雑なシーンを生成する際の本質的な難しさに起因しています。現在の手法は訓練サイズの制限に対処しようと試みていますが、しばしば深刻なアーティファクトを伴う人間中心のシーンを生成してしまいます。我々は、BeyondSceneという新しいフレームワークを提案します。このフレームワークは、既存の事前訓練済み拡散モデルを使用して、卓越したテキストと画像の対応性と自然さを備えた高解像度（8K以上）の人間中心のシーンを生成し、従来の制限を克服します。BeyondSceneは、段階的かつ階層的なアプローチを採用し、最初に複数の人間のインスタンス生成と拡散モデルのトークン制限を超えた詳細な記述に焦点を当てた詳細なベース画像を生成し、その後、訓練画像サイズを超え、テキストとインスタンスを意識した詳細を取り入れた高解像度出力にシームレスに変換します。これは、我々が提案する高周波注入フォワード拡散と適応的ジョイント拡散からなる新しいインスタンス認識階層的拡大プロセスを介して実現されます。BeyondSceneは、詳細なテキスト記述との対応性と自然さの点で既存の手法を凌駕し、高解像度の人間中心のシーン作成における高度な応用の道を開きます。これにより、コストのかかる再訓練なしに、事前訓練済み拡散モデルの容量を超えたシーン生成が可能になります。プロジェクトページ: https://janeyeon.github.io/beyond-scene。

English

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.

BeyondScene: 事前学習済み拡散モデルによる高解像度人間中心シーン生成

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

要旨

Support