3DreamBooth：高保真三维主体驱动视频生成模型

摘要

生成具有动态效果且视角一致的自定义主体视频，在沉浸式VR/AR、虚拟制片和下一代电子商务等新兴应用领域具有广泛需求。尽管当前基于主体驱动的视频生成技术发展迅速，但现有方法大多将主体视为二维实体，仅通过单视角视觉特征或文本提示进行身份迁移。由于真实世界的主体本质上是三维的，将这些以二维为核心的方法应用于三维物体定制时暴露出根本性局限：它们缺乏重建三维几何所需的完整空间先验知识。因此，在合成新视角时，这些方法只能为不可见区域生成合理但任意的细节，而非保持真实的三维身份特征。实现真正的三维感知定制面临两大挑战：多视角视频数据集的稀缺性，以及基于有限视频序列进行模型微调容易导致时序过拟合。为解决这些问题，我们提出了创新性的三维感知视频定制框架3DreamBooth与3Dapter。3DreamBooth通过单帧优化范式实现空间几何与时序运动的解耦，通过仅更新空间表征将强健的三维先验知识嵌入模型，无需依赖耗时的视频训练。为增强细粒度纹理表现并加速收敛，我们引入视觉条件模块3Dapter。该模块在单视角预训练后，通过非对称条件策略与主生成分支进行多视角联合优化，使其能够作为动态选择路由器，从极简参考集中查询视角特定的几何提示。项目页面：https://ko-lani.github.io/3DreamBooth/

English

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

3DreamBooth：高保真三维主体驱动视频生成模型

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

摘要

Support