OcclusionFormer: 面向布局引导图像生成的Z顺序编排

摘要

近年来，布局到图像模型在空间可控性方面取得了显著进展，但在物体间遮挡问题上仍存在不足。当边界框重叠时，现有方法大多缺乏显式遮挡信息，导致重叠区域的生成存在本质歧义，难以确定复杂的遮挡关系。因此，这些方法常常在重叠区域产生纹理混杂或物理层次不一致的结果。为解决此问题，我们首先构建了SA-Z——一个包含显式遮挡顺序与像素级标注的大规模数据集。基于所提出的数据集，我们引入OcclusionFormer——一种遮挡感知的扩散Transformer框架，通过解耦实例并利用体渲染进行合成，显式建模Z序优先级。此外，为确保精细的空间精度，我们提出查询对齐损失函数，对单个实例进行显式监督并增强语义一致性。该方法有效减少了重叠区域的歧义性，强制正确的遮挡依赖关系，并保持结构完整性，从而在不同场景下实现了显著的精度提升。

English

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.