自回歸生成多視角一致圖像
Auto-Regressively Generating Multi-View Consistent Images
June 23, 2025
作者: JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao, Yanye Lu
cs.AI
摘要
從人類指令生成多視角圖像對於三維內容創作至關重要。主要挑戰在於保持多視角間的一致性,並在多樣條件下有效合成形狀與紋理。本文提出多視角自回歸(MV-AR)方法,利用自回歸模型逐步從任意提示生成一致的多視角圖像。首先,AR模型的下一個詞預測能力顯著提升了其在促進漸進式多視角合成中的效能。當生成間隔較大的視角時,MV-AR能夠利用其所有先前的視角提取有效的參考信息。隨後,我們提出了一個統一模型,通過架構設計與訓練策略適應各種提示。為應對多種條件,我們引入了針對文本、相機姿態、圖像及形狀的條件注入模塊。為同時管理多模態條件,採用了漸進式訓練策略,該策略最初以文本到多視角(t2mv)模型為基線,通過隨機丟棄與組合條件來促進全面X到多視角(X2mv)模型的發展。最後,為緩解因高質量數據有限導致的過擬合問題,我們提出了“視角洗牌”數據增強技術,從而將訓練數據量擴展數個數量級。實驗展示了MV-AR的性能與多功能性,其能夠在一系列條件下一致生成多視角圖像,並與基於擴散的多視角圖像生成領先模型表現相當。代碼與模型將於https://github.com/MILab-PKU/MVAR發布。
English
Generating multi-view images from human instructions is crucial for 3D
content creation. The primary challenges involve maintaining consistency across
multiple views and effectively synthesizing shapes and textures under diverse
conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR)
method, which leverages an auto-regressive model to progressively generate
consistent multi-view images from arbitrary prompts. Firstly, the
next-token-prediction capability of the AR model significantly enhances its
effectiveness in facilitating progressive multi-view synthesis. When generating
widely-separated views, MV-AR can utilize all its preceding views to extract
effective reference information. Subsequently, we propose a unified model that
accommodates various prompts via architecture designing and training
strategies. To address multiple conditions, we introduce condition injection
modules for text, camera pose, image, and shape. To manage multi-modal
conditions simultaneously, a progressive training strategy is employed. This
strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to
enhance the development of a comprehensive X-to-multi-view (X2mv) model through
the randomly dropping and combining conditions. Finally, to alleviate the
overfitting problem caused by limited high-quality data, we propose the
"Shuffle View" data augmentation technique, thus significantly expanding the
training data by several magnitudes. Experiments demonstrate the performance
and versatility of our MV-AR, which consistently generates consistent
multi-view images across a range of conditions and performs on par with leading
diffusion-based multi-view image generation models. Code and models will be
released at https://github.com/MILab-PKU/MVAR.