OcclusionFormer：為基於佈局的圖像生成安排Z順序

摘要

近年來，佈局到圖像生成模型在空間可控性方面取得了顯著進展。然而，這類模型在處理物體間遮擋問題時仍面臨挑戰。當邊界框重疊時，現有方法大多缺乏明確的遮擋資訊，導致重疊區域的生成本質上存在歧義，且難以判定複雜的遮擋關係。因此，它們常在重疊區域產生糾纏的紋理或物理不一致的疊層。為解決此問題，我們首先建構了SA-Z，一個包含明確遮擋順序與像素級標註的大規模資料集。基於所提出的資料集，我們引入了OcclusionFormer，一種新穎的遮擋感知擴散變壓器框架，通過解耦實例並利用體積渲染進行合成，從而明確建模Z軸順序優先權。此外，為確保細粒度的空間精確度，我們提出了一種查詢對齊損失，該損失能明確監督各個實例並增強語義一致性。所提出的方法有效減少了重疊區域的歧義性，強制了正確的遮擋依賴關係，並保持了結構完整性，從而在多樣場景中實現了顯著的精度提升。

English

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.