Realiz3D: 도메인 인식 학습을 통한 사실적인 3D 생성

초록

우리는 종종 정밀한 기하학, 재질, 시점 제어를 따르는 사실적이면서도 3D 일관성 있는 이미지를 생성하는 것을 목표로 한다. 일반적으로 이는 수십억 장의 실제 이미지로 사전 훈련된 이미지 생성기를, 제어 신호에 대한 주석이 달린 합성 3D 자산의 렌더링을 사용하여 미세 조정함으로써 달성된다. 이러한 접근 방식은 원하는 제어를 학습할 수 있지만, 사진과 렌더링 간의 도메인 차이로 인해 이미지의 사실성이 종종 저하된다. 우리는 이 문제가 주로 모델이 제어 신호의 존재와 이미지의 합성적 외관 사이에 의도치 않은 연관성을 학습하기 때문에 발생한다는 것을 관찰한다. 이를 해결하기 위해, 우리는 제어와 시각적 도메인을 분리하는 확산 모델 훈련을 위한 경량 프레임워크인 Realiz3D를 소개한다. 핵심 아이디어는 작은 잔차 어댑터에 입력되어 도메인을 전환하는 공변량을 도입함으로써, 시각적 도메인(실제 또는 합성)을 다른 제어 신호와 별도로 명시적으로 학습하는 것이다. 그러면 생성기는 특정 시각적 도메인에 적합하지 않으면서 제어 가능성을 얻도록 훈련될 수 있다. 이러한 방식으로, 제어가 적용될 때에도 모델이 사실적인 이미지를 생성하도록 유도할 수 있다. 우리는 확산 기반 생성기에서 서로 다른 계층과 잡음 제거 단계의 역할에 대한 통찰을 활용하여 제어 전이성을 실제 도메인으로 향상시키며, 차이를 더욱 완화하는 새로운 훈련 및 추론 전략을 제공한다. 우리는 텍스트-멀티뷰 생성 및 3D 입력으로부터의 텍스처링 작업에서 Realiz3D의 장점을 입증하며, 3D 일관성 있고 사실적인 출력을 생성한다.

English

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.