ジオメトリ編集可能で外観を保持するオブジェクト合成

要旨

一般物体合成（GOC）は、対象物体を背景シーンに所望の幾何学的特性でシームレスに統合しつつ、その微細な外観詳細を同時に保持することを目指す。最近のアプローチでは、セマンティック埋め込みを導出し、それを高度な拡散モデルに統合することで、幾何学的に編集可能な生成を実現している。しかし、これらの高度にコンパクトな埋め込みは高レベルのセマンティックな手がかりのみを符号化し、必然的に微細な外観詳細を捨象してしまう。本論文では、幾何学的編集と外観保存を分離した拡散モデル（DGAD）を提案する。このモデルは、まずセマンティック埋め込みを活用して所望の幾何学的変換を暗黙的に捕捉し、次にクロスアテンション検索メカニズムを用いて微細な外観特徴を幾何学的に編集された表現と整合させ、物体合成における正確な幾何学的編集と忠実な外観保存の両方を実現する。具体的には、DGADはCLIP/DINO由来の参照ネットワークを基盤として、セマンティック埋め込みと外観保存表現を抽出し、それらを分離した形でエンコーディングおよびデコーディングパイプラインにシームレスに統合する。まず、セマンティック埋め込みを事前学習済みの拡散モデルに統合し、強力な空間推論能力を発揮させて物体の幾何学を暗黙的に捕捉し、柔軟な物体操作を可能にするとともに効果的な編集性を確保する。次に、暗黙的に学習された物体幾何学を活用して外観特徴を検索し、対応する領域と空間的に整合させる密なクロスアテンションメカニズムを設計し、忠実な外観一貫性を保証する。公開ベンチマークでの広範な実験により、提案するDGADフレームワークの有効性が実証された。

English

General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

ジオメトリ編集可能で外観を保持するオブジェクト合成

Geometry-Editable and Appearance-Preserving Object Compositon

要旨

Support