OcclusionFormer: レイアウトに基づく画像生成のためのZオーダーの配置

要旨

近年のレイアウト・画像変換モデルは、空間的な制御性において顕著な進歩を遂げている。しかしながら、物体間の遮蔽（オクルージョン）には依然として課題が残る。バウンディングボックスが重なり合う場合、既存手法の多くは明示的な遮蔽情報を欠いており、その結果、交差領域における生成が本質的に曖昧になり、複雑な遮蔽関係の決定を妨げている。これにより、重複領域ではしばしば絡み合ったテクスチャや物理的に不整合なレイヤリングが生じる。この問題に対処するため、我々はまず、明示的な遮蔽順序とピクセルレベルのアノテーションを備えた大規模データセットSA-Zを構築した。提案データセットに基づき、我々はOcclusionFormerを導入する。これは、新しい遮蔽認識型Diffusion Transformerフレームワークであり、インスタンスを分離し、ボリュームレンダリングを介して合成することでZオーダーの優先順位を明示的にモデル化する。さらに、きめ細かい空間精度を確保するため、個々のインスタンスを明示的に監視し、意味的一貫性を高めるクエリアライメント損失を導入する。提案手法は、重複領域の曖昧さを効果的に低減し、正しい遮蔽依存関係を強制し、構造的整合性を保持することで、多様なシーンにおいて大幅な精度向上をもたらす。

English

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.