構成と征服：拡散モデルに基づく3D深度認識可能な合成画像生成

要旨

テキストを正確なレイアウト表現の情報源として用いる際の限界に対処するため、多くの研究では生成画像内の特定の属性を条件付けるために追加の信号を組み込んでいます。これまでの研究は成功を収めているものの、三次元平面に拡張された属性の特定の位置情報を考慮していませんでした。この文脈において、我々は三次元オブジェクト配置の制御と、複数の例示画像からのグローバルなスタイル的意味論の分離表現を統合した条件付き拡散モデルを提案します。具体的には、まず深度分離トレーニングを導入し、オブジェクトの相対深度を推定器として活用することで、合成画像トリプレットを使用して未知のオブジェクトの絶対位置を特定できるようにします。また、追加の位置情報手がかりを使用せずに、ターゲット領域にグローバルな意味論を課す手法であるソフトガイダンスを導入します。我々の統合フレームワークであるCompose and Conquer（CnC）は、これらの技術を統合し、複数の条件を分離された形で位置付けることを可能にします。我々のアプローチが、異なる深度にあるオブジェクトの知覚を可能にし、異なるグローバルな意味論を持つ局所化されたオブジェクトを構成するための汎用的なフレームワークを提供することを実証します。コード: https://github.com/tomtom1103/compose-and-conquer/

English

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

構成と征服：拡散モデルに基づく3D深度認識可能な合成画像生成

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

要旨

Support