OcclusionFormer: 레이아웃 기반 이미지 생성을 위한 Z-순서 정렬

초록

최근 레이아웃-이미지 변환 모델은 공간적 제어 가능성에서 놀라운 진전을 이루었다. 그러나 객체 간 폐색 문제는 여전히 어려움을 겪고 있다. 경계 상자가 겹칠 때 대부분의 기존 방법은 명시적인 폐색 정보를 결여하므로, 교차 영역에서의 생성이 본질적으로 모호해지고 복잡한 폐색 관계를 결정하는 데 장애가 된다. 그 결과, 겹친 영역에서 얽힌 텍스처나 물리적으로 일관되지 않은 레이어링이 자주 발생한다. 이 문제를 해결하기 위해, 먼저 명시적 폐색 순서와 픽셀 수준 주석이 풍부하게 포함된 대규모 데이터셋 SA-Z를 구축한다. 제안된 데이터셋을 기반으로, 인스턴스를 분리하고 체적 렌더링을 통해 합성함으로써 Z-순서 우선순위를 명시적으로 모델링하는 새로운 폐색 인식 확산 트랜스포머 프레임워크인 OcclusionFormer를 소개한다. 또한, 세밀한 공간 정밀도를 보장하기 위해 개별 인스턴스를 명시적으로 감독하고 의미 일관성을 강화하는 질의 정렬 손실을 도입한다. 제안된 방법은 겹치는 영역에서의 모호성을 효과적으로 줄이고, 올바른 폐색 의존성을 강제하며, 구조적 무결성을 보존함으로써 다양한 장면에서 상당한 정확도 향상을 가져온다.

English

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.