자기회귀적으로 다중 뷰 일관성 이미지 생성하기

초록

인간의 지시로부터 다중 시점 이미지를 생성하는 것은 3D 콘텐츠 제작에 있어 매우 중요하다. 주요 과제는 다중 시점 간의 일관성을 유지하고 다양한 조건 하에서 형태와 질감을 효과적으로 합성하는 것이다. 본 논문에서는 자동 회귀 모델을 활용하여 임의의 프롬프트로부터 점진적으로 일관된 다중 시점 이미지를 생성하는 다중 시점 자동 회귀(Multi-View Auto-Regressive, MV-AR) 방법을 제안한다. 첫째, AR 모델의 다음 토큰 예측 능력은 점진적 다중 시점 합성을 촉진하는 데 있어 그 효과를 크게 향상시킨다. 넓게 분리된 시점을 생성할 때, MV-AR은 모든 이전 시점을 활용하여 효과적인 참조 정보를 추출할 수 있다. 둘째, 다양한 프롬프트를 수용할 수 있는 통합 모델을 아키텍처 설계 및 학습 전략을 통해 제안한다. 다중 조건을 처리하기 위해 텍스트, 카메라 포즈, 이미지, 형태에 대한 조건 주입 모듈을 도입한다. 다중 모달 조건을 동시에 관리하기 위해 점진적 학습 전략을 채택한다. 이 전략은 초기에 텍스트-다중 시점(t2mv) 모델을 기준으로 삼아, 조건을 무작위로 제거하고 결합함으로써 포괄적인 X-다중 시점(X2mv) 모델의 개발을 촉진한다. 마지막으로, 고품질 데이터의 제한으로 인한 과적합 문제를 완화하기 위해 "셔플 뷰(Shuffle View)" 데이터 증강 기법을 제안함으로써 학습 데이터를 크게 확장한다. 실험 결과, 우리의 MV-AR은 다양한 조건에서 일관된 다중 시점 이미지를 생성하며, 선도적인 확산 기반 다중 시점 이미지 생성 모델과 동등한 성능을 보인다. 코드와 모델은 https://github.com/MILab-PKU/MVAR에서 공개될 예정이다.

English

Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.

자기회귀적으로 다중 뷰 일관성 이미지 생성하기

Auto-Regressively Generating Multi-View Consistent Images

초록

Support