透视三维：文本到图像生成中的遮挡感知三维控制

摘要

我们将遮挡推理确立为三维布局条件生成中基础但被忽视的关键要素。该技术对于生成具有深度一致几何结构和尺度比例的部分遮挡物体至关重要。尽管现有方法能够生成符合输入布局的逼真场景，但往往难以精确建模物体间的遮挡关系。为此，我们提出SeeThrough3D模型，通过显式建模遮挡关系来实现三维布局条件生成。我们引入了一种遮挡感知的三维场景表示法（OSCR），将物体表现为虚拟环境中半透明的三维包围盒，并从指定摄像机视角进行渲染。透明度编码了被隐藏的物体区域，使模型能够推理遮挡关系，而渲染视角则为生成过程提供明确的摄像机控制。我们通过从渲染的三维表示中提取视觉标记集，对基于流的预训练文本到图像生成模型进行条件约束。此外，采用掩码自注意力机制精确绑定每个物体包围盒与其对应文本描述，从而实现多物体的准确生成而不出现属性混淆。为训练模型，我们构建了包含多种强遮挡关系多物体场景的合成数据集。SeeThrough3D能有效泛化至未见过的物体类别，在保持真实遮挡关系和一致摄像机控制的同时，实现精确的三维布局控制。

English

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.