构建与征服：基于扩散的三维深度感知可组合图像合成

摘要

为了解决文本作为文本条件扩散模型中准确布局表示的局限性，许多研究结合了额外信号来调节生成图像中的某些属性。尽管取得了成功，先前的研究并未考虑这些属性在三维平面中的具体定位。在这个背景下，我们提出了一个条件扩散模型，它将对三维物体放置的控制与来自多个示例图像的全局风格语义的解耦表示相结合。具体而言，我们首先引入深度解耦训练，利用物体的相对深度作为估计器，使模型能够通过使用合成图像三元组来识别未见物体的绝对位置。我们还引入了软引导，一种在不使用任何额外定位线索的情况下将全局语义施加到目标区域的方法。我们的集成框架Compose and Conquer（CnC）将这些技术统一起来，以解耦方式定位多个条件。我们证明了我们的方法允许感知不同深度的物体，同时为组合具有不同全局语义的局部物体提供了多功能框架。源代码：https://github.com/tomtom1103/compose-and-conquer/

English

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

构建与征服：基于扩散的三维深度感知可组合图像合成

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

摘要

Support