고화질 비디오-투-4D 합성을 위한 가우시안 변이 필드 확산

초록

본 논문에서는 단일 비디오 입력으로부터 고품질의 동적 3D 콘텐츠를 생성하는 비디오-투-4D 생성을 위한 새로운 프레임워크를 제시합니다. 직접적인 4D 확산 모델링은 데이터 구성의 높은 비용과 3D 형태, 외관, 움직임을 동시에 표현해야 하는 고차원적 특성으로 인해 매우 어려운 과제입니다. 이러한 문제를 해결하기 위해, 우리는 Direct 4DMesh-to-GS Variation Field VAE를 도입하여 3D 애니메이션 데이터로부터 정규화된 가우시안 스플랫(GS)과 그 시간적 변화를 인스턴스별 피팅 없이 직접 인코딩하고, 고차원 애니메이션을 간결한 잠재 공간으로 압축합니다. 이 효율적인 표현을 기반으로, 입력 비디오와 정규화된 GS를 조건으로 하는 시간 인식 확산 트랜스포머를 활용한 가우시안 변이 필드 확산 모델을 학습합니다. Objaverse 데이터셋에서 선별된 애니메이션 가능한 3D 객체를 학습 데이터로 사용하여, 우리의 모델은 기존 방법 대비 우수한 생성 품질을 보여줍니다. 또한, 합성 데이터로만 학습되었음에도 불구하고 실제 비디오 입력에 대한 놀라운 일반화 능력을 보여주며, 고품질 애니메이션 3D 콘텐츠 생성의 길을 열어줍니다. 프로젝트 페이지: https://gvfdiffusion.github.io/.

English

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

고화질 비디오-투-4D 합성을 위한 가우시안 변이 필드 확산

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

초록

Support