几何可编辑且外观保持的对象合成
Geometry-Editable and Appearance-Preserving Object Compositon
May 27, 2025
作者: Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen
cs.AI
摘要
通用对象合成(GOC)旨在将目标对象无缝融入背景场景中,同时满足所需的几何属性,并保留其细粒度的外观细节。现有方法通过提取语义嵌入并将其整合到先进的扩散模型中,以实现几何可编辑的生成。然而,这些高度压缩的嵌入仅编码了高层次语义线索,不可避免地丢失了细粒度的外观细节。我们提出了一种解耦几何可编辑与外观保持的扩散模型(DGAD),该模型首先利用语义嵌入隐式捕捉所需的几何变换,随后通过交叉注意力检索机制将细粒度外观特征与几何编辑后的表示对齐,从而在对象合成中实现精确的几何编辑和忠实的外观保持。具体而言,DGAD基于CLIP/DINO衍生网络和参考网络提取语义嵌入和外观保持表示,并以解耦方式无缝整合到编码与解码流程中。我们首先将语义嵌入整合到预训练的扩散模型中,这些模型展现出强大的空间推理能力,以隐式捕捉对象几何,从而促进灵活的对象操控并确保有效的可编辑性。接着,我们设计了一种密集交叉注意力机制,利用隐式学习的对象几何来检索并将外观特征与其对应区域进行空间对齐,确保外观的一致性。在公开基准上的大量实验验证了所提DGAD框架的有效性。
English
General object composition (GOC) aims to seamlessly integrate a target object
into a background scene with desired geometric properties, while simultaneously
preserving its fine-grained appearance details. Recent approaches derive
semantic embeddings and integrate them into advanced diffusion models to enable
geometry-editable generation. However, these highly compact embeddings encode
only high-level semantic cues and inevitably discard fine-grained appearance
details. We introduce a Disentangled Geometry-editable and
Appearance-preserving Diffusion (DGAD) model that first leverages semantic
embeddings to implicitly capture the desired geometric transformations and then
employs a cross-attention retrieval mechanism to align fine-grained appearance
features with the geometry-edited representation, facilitating both precise
geometry editing and faithful appearance preservation in object composition.
Specifically, DGAD builds on CLIP/DINO-derived and reference networks to
extract semantic embeddings and appearance-preserving representations, which
are then seamlessly integrated into the encoding and decoding pipelines in a
disentangled manner. We first integrate the semantic embeddings into
pre-trained diffusion models that exhibit strong spatial reasoning capabilities
to implicitly capture object geometry, thereby facilitating flexible object
manipulation and ensuring effective editability. Then, we design a dense
cross-attention mechanism that leverages the implicitly learned object geometry
to retrieve and spatially align appearance features with their corresponding
regions, ensuring faithful appearance consistency. Extensive experiments on
public benchmarks demonstrate the effectiveness of the proposed DGAD framework.