MV-アダプター：マルチビュー一貫した画像生成を容易にする

要旨

既存の多視点画像生成手法は、事前に学習されたテキストから画像への変換（T2I）モデルに侵入的な変更を加え、完全なファインチューニングが必要とされるため、（1）大規模なベースモデルや高解像度の画像に特に高い計算コストがかかり、（2）最適化の困難さや高品質な3Dデータの不足により画像品質が低下することがあります。本論文では、初めてアダプターをベースとした多視点画像生成の解決策を提案し、T2Iモデルおよびその派生物を強化する汎用的なプラグアンドプレイのアダプターであるMV-Adapterを紹介します。MV-Adapterは、元のネットワーク構造や特徴空間を変更せずに、より少ないパラメータの更新により効率的なトレーニングを可能にし、事前に学習されたモデルに埋め込まれた事前知識を保持し、過学習のリスクを軽減します。アダプター内で3Dジオメトリ知識を効率的にモデル化するために、自己注意レイヤーを複製し並列注意アーキテクチャを含む革新的な設計を導入します。これにより、アダプターは事前に学習されたモデルの強力な事前知識を継承し、新しい3D知識をモデル化することが可能となります。さらに、カメラパラメータと幾何学情報をシームレスに統合する統一された条件エンコーダを提供し、テキストおよび画像ベースの3D生成やテクスチャリングなどのアプリケーションを容易にします。MV-Adapterは、Stable Diffusion XL（SDXL）上で768解像度で多視点生成を達成し、適応性と汎用性を示します。また、任意の視点生成にも拡張可能であり、より広範なアプリケーションが可能となります。MV-Adapterは、多視点画像生成において新たな品質基準を設定し、その効率性、適応性、汎用性により新たな可能性を切り拓くことを示しています。

English

Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

MV-アダプター：マルチビュー一貫した画像生成を容易にする

MV-Adapter: Multi-view Consistent Image Generation Made Easy

要旨

Support