3DreamBooth: 고해상도 3D 객체 주도 비디오 생성 모델

초록

맞춤형 대상의 동적이고 시점 일관적인 비디오 생성은 몰입형 VR/AR, 가상 제작, 차세대 e-커머스를 포함한 다양한 신흥 애플리케이션을 위해 매우 중요하게 요구되고 있습니다. 그러나 대상 기반 비디오 생성의 급속한 발전에도 불구하고, 기존 방법들은 주로 대상을 2D 개체로 취급하며 단일 시각적 특징이나 텍스트 프롬프트를 통해 정체성을 전달하는 데 집중해 왔습니다. 실제 세계의 대상은 본질적으로 3D이기 때문에, 이러한 2D 중심 접근법을 3D 객체 맞춤화에 적용할 때 근본적인 한계가 드러납니다: 바로 3D 기하학을 재구성하는 데 필요한 포괄적인 공간 사전 지식이 부족하다는 점입니다. 그 결과, 새로운 시점을 합성할 때 보이지 않는 영역에 대해 그럴듯하지만 임의적인 세부 사항을 생성해야 하며, 진정한 3D 정체성을 보존하지 못합니다. 다중 시점 비디오 데이터셋의 부족으로 인해 진정한 3D 인식 맞춤화를 달성하는 것은 여전히 어려운 과제입니다. 제한된 비디오 시퀀스로 모델을 미세 조정하려는 시도를 할 수 있지만, 이는 종종 시간적 과적합을 초래합니다. 이러한 문제를 해결하기 위해, 우리는 3DreamBooth와 3Dapter로 구성된 3D 인식 비디오 맞춤화를 위한 새로운 프레임워크를 소개합니다. 3DreamBooth는 1-프레임 최적화 패러다임을 통해 시간적 운동으로부터 공간 기하학을 분리합니다. 공간 표현에 대한 업데이트를 제한함으로써, 이 방법은 방대한 비디오 기반 훈련 없이도 모델에 강력한 3D 사전 지식을 효과적으로 구워 넣습니다. 정교한 질감을 향상하고 수렴 속도를 가속화하기 위해 우리는 시각적 조건화 모듈인 3Dapter를 통합합니다. 단일 시점 사전 훈련 후, 3Dapter는 비대칭 조건화 전략을 통해 주요 생성 분기와 함께 다중 시점 공동 최적화를 거칩니다. 이 설계는 모듈이 최소 참조 세트로부터 시점 특정 기하학적 힌트를 질의하는 동적 선택적 라우터 역할을 할 수 있게 합니다. 프로젝트 페이지: https://ko-lani.github.io/3DreamBooth/

English

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

3DreamBooth: 고해상도 3D 객체 주도 비디오 생성 모델

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

초록

Support