3DreamBooth：高保真三維主體驅動影片生成模型

摘要

為各類新興應用（如沉浸式VR/AR、虛擬製片及新一代電子商務）創建具有動態性且視角一致的自訂主體影片，已成為迫切需求。然而，儘管主體驅動影片生成技術快速發展，現有方法大多將主體視為二維實體，僅透過單視角視覺特徵或文字提示來遷移身份識別。由於真實世界的主體本質上是三維的，將這些以二維為核心的方法應用於三維物件自訂時，會暴露根本性局限：它們缺乏重建三維幾何所需的完整空間先驗知識。因此，在合成新視角時，這些方法只能為未見區域生成合理但任意的細節，而非保留真實的三維身份。實現真正的三維感知自訂仍面臨挑戰，主因在於多視角影片資料集的匱乏。雖然可嘗試對有限影片序列進行模型微調，但這往往導致時間域過度擬合。為解決這些問題，我們提出創新的三維感知影片自訂框架，包含3DreamBooth與3Dapter兩大組件。3DreamBooth透過單幀優化範式，將空間幾何與時間運動解耦：通過限制空間表徵的更新，無需耗時的影片訓練即可將強健的三維先驗知識內化至模型中。為增強細粒度紋理並加速收斂，我們整合視覺條件模組3Dapter。在單視角預訓練後，3Dapter透過非對稱條件策略與主生成分支進行多視角聯合優化。此設計使該模組能作為動態選擇性路由器，從極簡參考集中查詢視角專屬的幾何提示。專案頁面：https://ko-lani.github.io/3DreamBooth/

English

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

3DreamBooth：高保真三維主體驅動影片生成模型

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

摘要

Support