透過型3D：テキストから画像生成におけるオクルージョン認識3次元制御

要旨

我々は、3Dレイアウト条件付き生成において、オクルージョン推論が基本的でありながら見過ごされてきた側面であると位置づける。これは、深度整合性のあるジオメトリとスケールで部分的に遮蔽されたオブジェクトを合成するために不可欠である。既存手法は入力レイアウトに従った現実的なシーンを生成できるが、精密なオブジェクト間のオクルージョンをモデル化することは困難である。我々は、オクルージョンを明示的にモデル化する3Dレイアウト条件付き生成モデルであるSeeThrough3Dを提案する。オブジェクトが仮想環境内に配置された半透明の3Dボックスとして描かれ、所望のカメラ視点からレンダリングされる、オクルージョン認識型3Dシーン表現（OSCR）を導入する。透明性は隠蔽されたオブジェクト領域を符号化し、モデルがオクルージョンを推論することを可能にする一方、レンダリングされた視点は生成中に明示的なカメラ制御を提供する。事前学習済みのフローベースのテキスト対画像生成モデルに対して、レンダリングされた3D表現から導出された一連の視覚的トークンを導入することで条件付けを行う。さらに、マスク付き自己注意を適用し、各オブジェクトのバウンディングボックスを対応するテキスト記述に正確に紐付け、オブジェクト属性の混合なしに複数のオブジェクトを正確に生成することを可能にする。モデルを訓練するため、強力なオブジェクト間オクルージョンを持つ多様な多オブジェクトシーンから成る合成データセットを構築する。SeeThrough3Dは未見のオブジェクトカテゴリに対しても効果的に一般化し、現実的なオクルージョンと一貫したカメラ制御による精密な3Dレイアウト制御を実現する。

English

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

透過型3D：テキストから画像生成におけるオクルージョン認識3次元制御

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

要旨

Support