Realiz3D：透過領域感知學習實現照片級真實感的3D生成

摘要

我們通常希望生成既具照片級真實感又具3D一致性的圖像，並遵循精確的幾何、材質和視角控制。通常，這是通過使用合成3D資產的渲染圖（其中包含控制信號的註釋）對預訓練於數十億張真實圖像的圖像生成器進行微調來實現的。雖然這種方法可以學習所需的控制，但由於照片與渲染圖之間的領域差距，往往會損害圖像的真實感。我們觀察到，這個問題主要源於模型學習了控制信號的存在與圖像合成外觀之間的意外關聯。為了解決這個問題，我們提出了Realiz3D，一個輕量級的擴散模型訓練框架，它將控制信號與視覺領域解耦。關鍵思想是通過引入一個協變量，將其輸入到小型殘差適配器中以改變領域，從而將視覺領域（真實或合成）與其他控制信號分開學習。這樣，生成器可以在不擬合特定視覺領域的情況下訓練以獲得可控性。透過這種方式，即使施加控制信號，模型也能被引導生成逼真的圖像。我們利用對擴散生成器中不同層次和去噪步驟角色的見解，增強了控制向真實領域的遷移能力，並提出了新的訓練和推理策略以進一步縮小差距。我們展示了Realiz3D在文本到多視圖生成和從3D輸入進行紋理貼圖等任務中的優勢，其輸出具有3D一致性和照片級真實感。

English

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.