Styl3R：任意のシーンとスタイルに対する即時3Dスタイライズド再構成

要旨

多視点一貫性を維持しつつ、スタイル画像に忠実に似た3Dシーンの即時スタイライズは、依然として重要な課題である。現在の最先端の3Dスタイライズ手法は、通常、計算集約的なテスト時最適化を伴い、事前学習済みの3D表現に芸術的特徴を転送するために、密なポーズ付き入力画像を必要とする。これに対し、フィードフォワード再構成モデルの最近の進展を活用し、ポーズなしの疎視点シーン画像と任意のスタイル画像を使用して、1秒未満で直接3Dスタイライズを実現する新しいアプローチを提案する。再構成とスタイライズの間の本質的な分離に対処するため、構造モデリングと外観シェーディングを分離する分岐アーキテクチャを導入し、スタイリッシュな転送が基盤となる3Dシーン構造を歪めることを効果的に防止する。さらに、新規視点合成タスクを通じてスタイライズモデルの事前学習を促進するために、同一性損失を適応させる。この戦略により、モデルはスタイライズのために微調整されながらも、元の再構成能力を保持することができる。ドメイン内およびドメイン外のデータセットを使用した包括的な評価により、本手法がスタイルとシーン外観の優れた融合を実現する高品質なスタイライズ3Dコンテンツを生成し、多視点一貫性と効率性の点で既存の手法を上回ることが示された。

English

Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.