ChatPaper.aiChatPaper

几何可编辑且外观保持的对象合成

Geometry-Editable and Appearance-Preserving Object Compositon

May 27, 2025
作者: Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen
cs.AI

摘要

通用对象合成(GOC)旨在将目标对象无缝融入背景场景中,同时满足所需的几何属性,并保留其细粒度的外观细节。现有方法通过提取语义嵌入并将其整合到先进的扩散模型中,以实现几何可编辑的生成。然而,这些高度压缩的嵌入仅编码了高层次语义线索,不可避免地丢失了细粒度的外观细节。我们提出了一种解耦几何可编辑与外观保持的扩散模型(DGAD),该模型首先利用语义嵌入隐式捕捉所需的几何变换,随后通过交叉注意力检索机制将细粒度外观特征与几何编辑后的表示对齐,从而在对象合成中实现精确的几何编辑和忠实的外观保持。具体而言,DGAD基于CLIP/DINO衍生网络和参考网络提取语义嵌入和外观保持表示,并以解耦方式无缝整合到编码与解码流程中。我们首先将语义嵌入整合到预训练的扩散模型中,这些模型展现出强大的空间推理能力,以隐式捕捉对象几何,从而促进灵活的对象操控并确保有效的可编辑性。接着,我们设计了一种密集交叉注意力机制,利用隐式学习的对象几何来检索并将外观特征与其对应区域进行空间对齐,确保外观的一致性。在公开基准上的大量实验验证了所提DGAD框架的有效性。
English
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.
PDF52June 6, 2025