自回归生成多视角一致性图像

摘要

从人类指令生成多视角图像对于3D内容创作至关重要。主要挑战在于保持多视角间的一致性，并在多样化条件下有效合成形状与纹理。本文提出多视角自回归（MV-AR）方法，利用自回归模型逐步从任意提示生成一致的多视角图像。首先，AR模型的下一个标记预测能力显著提升了其在促进渐进式多视角合成中的效能。当生成间隔较大的视角时，MV-AR能够利用其所有先前视角提取有效的参考信息。随后，我们提出一个统一模型，通过架构设计与训练策略适应多种提示。为应对多重条件，我们引入了针对文本、相机姿态、图像和形状的条件注入模块。为同时处理多模态条件，采用了渐进式训练策略。该策略首先以文本到多视角（t2mv）模型为基线，通过随机丢弃与组合条件，推动全面X到多视角（X2mv）模型的发展。最后，为缓解高质量数据有限导致的过拟合问题，我们提出了“视图洗牌”数据增强技术，从而显著扩展了训练数据量级。实验验证了MV-AR的性能与多功能性，它在一系列条件下持续生成一致的多视角图像，并与领先的基于扩散的多视角图像生成模型表现相当。代码与模型将发布于https://github.com/MILab-PKU/MVAR。

English

Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.

自回归生成多视角一致性图像

Auto-Regressively Generating Multi-View Consistent Images

摘要

Support