3DreamBooth：高精細3D被写体駆動型ビデオ生成モデル

要旨

カスタマイズされた被写体の動的で視点整合性のある動画生成は、没入型VR/AR、仮想プロダクション、次世代Eコマースなど、多岐にわたる新興アプリケーションにおいて強く求められています。しかし、被写体駆動の動画生成が急速に進歩しているにもかかわらず、既存手法の多くは被写体を2Dエンティティとして扱い、単一視点の視覚特徴やテキストプロンプトを通じた同一性転送に焦点を当てています。現実世界の被写体は本質的に3Dであるため、これらの2D中心アプローチを3Dオブジェクトのカスタマイズに適用すると、根本的な限界が明らかになります：3Dジオメトリを再構築するために必要な包括的な空間事前情報を欠いているのです。その結果、新規視点の合成時には、未観測領域に対して真の3Dアイデンティティを保持するのではなく、もっともらしいが任意の詳細を生成することに依存せざるを得ません。真に3Dを意識したカスタマイズを実現することは、多視点動画データセットの不足により依然として困難な課題です。限られた動画シーケンスでモデルのファインチューニングを試みることも可能ですが、これは時間的な過学習を招くことが多いです。これらの課題を解決するため、我々は3Dを意識した動画カスタマイズの新規フレームワークである3DreamBoothと3Dapterを提案します。3DreamBoothは、1フレーム最適化パラダイムを通じて空間ジオメトリと時間的動きを分離します。空間表現の更新を制限することで、網羅的な動画ベースのトレーニングを必要とせず、ロバストな3D事前情報をモデルに効果的に焼き付けます。微細なテクスチャを強化し収束を加速させるため、視覚的条件付けモジュールである3Dapterを組み込みます。単一視点事前学習後、3Dapterは非対称条件付け戦略によりメイン生成ブランチと共に多視点共同最適化を実施します。この設計により、本モジュールは動的選択ルーターとして機能し、最小限の参照セットから視点固有の幾何学的ヒントを問い合わせることが可能になります。プロジェクトページ: https://ko-lani.github.io/3DreamBooth/

English

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

3DreamBooth：高精細3D被写体駆動型ビデオ生成モデル

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

要旨

Support