Free4D: 공간-시간적 일관성을 갖춘 튜닝 없는 4D 장면 생성

초록

단일 이미지로부터 4D 장면을 생성하기 위한 새로운 튜닝 프리 프레임워크인 Free4D를 소개합니다. 기존 방법들은 객체 수준 생성에 초점을 맞춰 장면 수준 생성을 불가능하게 하거나, 대규모 다중 뷰 비디오 데이터셋에 의존한 고비용 훈련을 필요로 하며, 4D 장면 데이터의 부족으로 인해 일반화 능력이 제한적이었습니다. 이와 대조적으로, 우리의 핵심 통찰은 사전 훈련된 파운데이션 모델을 일관된 4D 장면 표현으로 증류하는 것으로, 이는 효율성과 일반화 가능성과 같은 유망한 장점을 제공합니다. 1) 이를 달성하기 위해, 먼저 이미지-투-비디오 확산 모델을 사용하여 입력 이미지를 애니메이션화한 후 4D 기하학적 구조 초기화를 수행합니다. 2) 이 거친 구조를 공간-시간적으로 일관된 다중 뷰 비디오로 변환하기 위해, 공간 일관성을 위한 포인트 가이드 노이즈 제거 전략과 시간적 일관성을 위한 새로운 잠재 교체 전략을 포함한 적응형 가이던스 메커니즘을 설계합니다. 3) 생성된 관측치를 일관된 4D 표현으로 끌어올리기 위해, 생성된 정보를 최대한 활용하면서 불일치를 완화하는 변조 기반 정제 방법을 제안합니다. 결과적으로 얻은 4D 표현은 실시간 제어 가능한 렌더링을 가능하게 하여, 단일 이미지 기반 4D 장면 생성에서 중요한 진전을 이루었습니다.

English

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.