構圖與征服：基於擴散的三維深度感知可組合圖像合成

摘要

為解決以文本作為準確佈局表示的來源在文本條件擴散模型中的限制，許多研究作品納入額外信號，以條件化生成圖像中的某些屬性。儘管取得成功，先前的作品未考慮這些屬性在三維平面中的具體定位。在這個背景下，我們提出了一個條件擴散模型，該模型整合了對三維物體放置的控制，並從多個範例圖像中解開全局風格語義的表示。具體而言，我們首先引入深度解開訓練，利用物體的相對深度作為估算器，使模型能夠通過使用合成圖像三元組識別看不見物體的絕對位置。我們還引入了軟引導，這是一種在目標區域上施加全局語義而無需使用任何額外定位線索的方法。我們的集成框架「組合與征服」（CnC）將這些技術統一起來，以分離的方式定位多個條件。我們展示了我們的方法允許感知不同深度的物體，同時提供了一個多才多藝的框架，用於組合具有不同全局語義的局部物體。程式碼：https://github.com/tomtom1103/compose-and-conquer/

English

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

構圖與征服：基於擴散的三維深度感知可組合圖像合成

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

摘要

Support