4DNeX: 순방향 4D 생성 모델링을 쉽게 만드는 방법

초록

단일 이미지로부터 4D(즉, 동적 3D) 장면 표현을 생성하는 최초의 피드포워드 프레임워크인 4DNeX를 소개합니다. 기존 방법들이 계산 집약적인 최적화에 의존하거나 다중 프레임 비디오 입력을 요구하는 것과 달리, 4DNeX는 사전 학습된 비디오 확산 모델을 미세 조정함으로써 효율적인 엔드투엔드 이미지-투-4D 생성을 가능하게 합니다. 구체적으로, 1) 4D 데이터의 부족 문제를 완화하기 위해, 고급 재구성 기법을 사용해 생성된 고품질 4D 주석이 포함된 대규모 데이터셋인 4DNeX-10M을 구축했습니다. 2) RGB와 XYZ 시퀀스를 공동으로 모델링하는 통합 6D 비디오 표현을 도입하여 외관과 기하학 구조를 체계적으로 학습할 수 있도록 했습니다. 3) 사전 학습된 비디오 확산 모델을 4D 모델링에 활용하기 위한 간단하지만 효과적인 적응 전략 세트를 제안했습니다. 4DNeX는 새로운 시점 비디오 합성을 가능하게 하는 고품질 동적 포인트 클라우드를 생성합니다. 광범위한 실험을 통해 4DNeX가 기존 4D 생성 방법들보다 효율성과 일반화 능력에서 우수함을 입증했으며, 이미지-투-4D 모델링을 위한 확장 가능한 솔루션을 제공하고 동적 장면 진화를 시뮬레이션하는 생성적 4D 세계 모델의 기반을 마련했습니다.

English

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

4DNeX: 순방향 4D 생성 모델링을 쉽게 만드는 방법

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

초록

Support