カジュアルビデオの高速ビュー合成

要旨

実世界のビデオからの新視点合成は、シーンのダイナミクスや視差の欠如といった課題により困難です。既存の手法は、暗黙的なニューラルラジアンスフィールドを用いて有望な結果を示していますが、学習とレンダリングに時間がかかります。本論文では、単眼ビデオから高品質な新視点を効率的に合成するために、明示的なビデオ表現を再検討します。静的および動的なビデオコンテンツを別々に扱います。具体的には、拡張された平面ベースのシーン表現を用いてグローバルな静的シーンモデルを構築し、時間的に一貫した新ビデオを合成します。平面ベースのシーン表現は、球面調和関数とディスプレイスメントマップを追加することで、視点依存効果を捉え、非平面の複雑な表面形状をモデル化します。動的コンテンツは効率性を考慮して、フレームごとの点群として表現します。このような表現は一貫性に欠ける傾向がありますが、動きにより微小な時間的な不整合は知覚的にマスクされます。我々は、このハイブリッドビデオ表現を迅速に推定し、リアルタイムで新視点をレンダリングする方法を開発しました。実験結果から、我々の手法は実世界のビデオから高品質な新視点をレンダリングでき、最先端の手法と同等の品質を維持しながら、学習速度が100倍速く、リアルタイムレンダリングを可能にすることが示されました。

English

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

カジュアルビデオの高速ビュー合成

Fast View Synthesis of Casual Videos

要旨

Support