FlexWorld: 유연한 시점 합성을 위한 점진적 3D 장면 확장

초록

단일 이미지로부터 360도 회전 및 줌을 포함한 유연한 시점의 3D 장면을 생성하는 것은 3D 데이터의 부족으로 인해 어려운 과제입니다. 이를 위해 우리는 두 가지 핵심 구성 요소로 이루어진 새로운 프레임워크인 FlexWorld를 소개합니다: (1) 거친 장면에서 렌더링된 불완전한 입력으로부터 고품질의 새로운 시점 이미지를 생성하기 위한 강력한 비디오-투-비디오(V2V) 확산 모델, 그리고 (2) 완전한 3D 장면을 구성하기 위한 점진적 확장 프로세스. 특히, 사전 훈련된 고급 비디오 모델과 정확한 깊이 추정 훈련 쌍을 활용함으로써, 우리의 V2V 모델은 큰 카메라 포즈 변화 하에서도 새로운 시점을 생성할 수 있습니다. 이를 기반으로 FlexWorld는 새로운 3D 콘텐츠를 점진적으로 생성하고 기하학적 장면 융합을 통해 전역 장면에 통합합니다. 광범위한 실험을 통해 FlexWorld가 단일 이미지로부터 고품질의 새로운 시점 비디오와 유연한 시점의 3D 장면을 생성하는 데 있어 기존의 최첨단 방법들보다 우수한 시각적 품질을 여러 인기 있는 메트릭과 데이터셋에서 달성함을 입증했습니다. 질적으로, FlexWorld가 360도 회전 및 줌과 같은 유연한 시점을 가진 고해상도 장면을 생성할 수 있음을 강조합니다. 프로젝트 페이지: https://ml-gsai.github.io/FlexWorld.

English

Generating flexible-view 3D scenes, including 360{\deg} rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360{\deg} rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.