Realiz3D: ドメイン認識学習によるフォトリアリスティックな3D生成

要旨

私たちは、正確な形状、材質、視点の制御に従い、フォトリアリスティックかつ3D一貫性のある画像を生成することをしばしば目指す。通常、これは数十億枚の実写画像で事前学習された画像生成器を、制御信号のアノテーションが利用可能な合成3Dアセットのレンダリングを用いて微調整することで達成される。このアプローチは所望の制御を学習できるものの、写真とレンダリングの間のドメインギャップにより、画像のリアリズムが損なわれることが多い。我々は、この問題が主に、モデルが制御信号の存在と画像の合成外観との間で意図しない関連性を学習することに起因すると考える。これに対処するため、制御と視覚ドメインを分離する、拡散モデル学習用の軽量フレームワークRealiz3Dを導入する。中心となるアイデアは、小さな残差アダプターに入力されてドメインをシフトさせる共変量を導入することにより、視覚ドメイン（実写または合成）を他の制御信号とは別個に明示的に学習することである。これにより、生成器は特定の視覚ドメインに適合することなく、制御可能性を獲得するよう学習できる。このようにして、制御が適用された場合でも、モデルを現実的な画像生成へと導くことができる。我々は、拡散ベース生成器における異なる層とノイズ除去ステップの役割に関する知見を活用し、ギャップをさらに緩和する新たな学習・推論戦略を導入することで、実ドメインへの制御転送可能性を向上させる。テキストから多視点画像生成や3D入力からのテクスチャリングといったタスクにおいて、Realiz3Dが3D一貫性とフォトリアリズムを兼ね備えた出力を生成する利点を示す。

English

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.