구성과 정복: 확산 기반 3D 깊이 인식 조합 가능 이미지 합성

초록

텍스트가 텍스트 조건부 확산 모델에서 정확한 레이아웃 표현을 제공하는 데 한계가 있음을 해결하기 위해, 많은 연구에서는 생성된 이미지 내 특정 속성을 조건화하기 위해 추가 신호를 통합합니다. 이러한 접근은 성공적이었지만, 기존 연구들은 이러한 속성의 구체적인 위치를 3차원 평면으로 확장하여 고려하지 않았습니다. 이러한 맥락에서, 우리는 3차원 객체 배치에 대한 제어와 여러 예시 이미지로부터의 전역적 스타일 의미론을 분리된 표현으로 통합한 조건부 확산 모델을 제시합니다. 구체적으로, 우리는 먼저 객체의 상대적 깊이를 추정기로 활용하기 위해 깊이 분리 학습을 도입하여, 합성 이미지 삼중항을 사용해 보이지 않는 객체의 절대적 위치를 식별할 수 있도록 합니다. 또한, 추가적인 위치 정보 없이 전역적 의미론을 대상 영역에 부과하는 소프트 가이던스 방법을 소개합니다. 우리의 통합 프레임워크인 Compose and Conquer(CnC)는 이러한 기술들을 통합하여 여러 조건을 분리된 방식으로 위치 지정합니다. 우리의 접근 방식이 다양한 깊이에서 객체를 인식할 수 있으면서도, 다른 전역적 의미론을 가진 지역화된 객체를 구성하기 위한 다용도 프레임워크를 제공함을 입증합니다. 코드: https://github.com/tomtom1103/compose-and-conquer/

English

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce depth disentanglement training to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce soft guidance, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, Compose and Conquer (CnC), unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

구성과 정복: 확산 기반 3D 깊이 인식 조합 가능 이미지 합성

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

초록

Support