自己回帰的に多視点整合性のある画像を生成する

要旨

人間の指示から多視点画像を生成することは、3Dコンテンツ作成において極めて重要である。主な課題は、複数の視点間で一貫性を維持し、多様な条件下で形状とテクスチャを効果的に合成することである。本論文では、任意のプロンプトから一貫した多視点画像を段階的に生成するために、オートリグレッシブモデルを活用したMulti-View Auto-Regressive (MV-AR) 手法を提案する。まず、ARモデルの次トークン予測能力は、段階的な多視点合成を促進する上でその有効性を大幅に向上させる。広く離れた視点を生成する際、MV-ARはその前のすべての視点を利用して効果的な参照情報を抽出することができる。次に、アーキテクチャ設計とトレーニング戦略を通じて様々なプロンプトに対応する統一モデルを提案する。複数の条件に対処するために、テキスト、カメラポーズ、画像、形状のための条件注入モジュールを導入する。多モーダル条件を同時に管理するために、段階的なトレーニング戦略を採用する。この戦略では、最初にテキストから多視点 (t2mv) モデルをベースラインとして採用し、条件をランダムにドロップおよび組み合わせることによって包括的なX-to-multi-view (X2mv) モデルの開発を促進する。最後に、高品質なデータの不足による過学習問題を緩和するために、「Shuffle View」データ拡張技術を提案し、トレーニングデータを数倍に拡大する。実験により、MV-ARの性能と汎用性が実証され、様々な条件下で一貫した多視点画像を生成し、主要な拡散ベースの多視点画像生成モデルと同等の性能を発揮することが示された。コードとモデルは https://github.com/MILab-PKU/MVAR で公開される。

English

Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.

自己回帰的に多視点整合性のある画像を生成する

Auto-Regressively Generating Multi-View Consistent Images

要旨

Support