NormalCrafter: ビデオから時間的に一貫した法線を学習する拡散モデル事前分布

要旨

表面法線推定は、コンピュータビジョンアプリケーションの広範な分野において基盤となる技術です。静止画像シナリオに多くの努力が注がれてきた一方で、ビデオベースの法線推定における時間的整合性の確保は依然として大きな課題です。既存の手法に単に時間的要素を追加するのではなく、我々はNormalCrafterを提案し、ビデオ拡散モデルの持つ時間的な事前知識を活用します。シーケンス全体にわたる高精度な法線推定を実現するため、セマンティック特徴正則化（SFR）を導入し、拡散特徴を意味的手がかりと整合させることで、モデルがシーンの本質的な意味に集中するよう促します。さらに、空間的な精度を保ちつつ長期的な時間的文脈を維持するために、潜在空間とピクセル空間の両方での学習を活用する二段階トレーニングプロトコルを提案します。広範な評価を通じて、本手法の有効性が実証され、多様なビデオから精緻な詳細を伴う時間的に一貫した法線シーケンスを生成する優れた性能が示されています。

English

Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

NormalCrafter: ビデオから時間的に一貫した法線を学習する拡散モデル事前分布

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

要旨

Support