動的シーンにおけるRGBのみに基づくカメラパラメータ最適化

要旨

COLMAPは長らく静的なシーンにおけるカメラパラメータ最適化の主要な手法として残ってきたが、その長時間の実行時間と動的なシーンへの適用におけるグラウンドトゥルース（GT）モーションマスクへの依存性によって制約を受けている。多くの研究が、GT焦点距離、モーションマスク、3D点群、カメラポーズ、メトリック深度などのより多くの事前情報を教師信号として組み込むことで改善を試みてきたが、これらは通常、カジュアルに撮影されたRGBビデオでは利用できない。本論文では、単一のRGBビデオのみを教師信号として、動的なシーンにおけるより正確で効率的なカメラパラメータ最適化のための新たな手法を提案する。我々の手法は以下の3つの主要なコンポーネントから構成される：(1) パッチ単位のトラッキングフィルタ。これにより、RGBビデオ全体にわたる頑健で最大限に疎なヒンジ状の関係を確立する。(2) 外れ値対応の共同最適化。モーションの事前情報に依存せず、移動する外れ値を適応的に重み付けすることで、効率的なカメラパラメータ最適化を実現する。(3) 二段階最適化戦略。Softplus制限と損失関数の凸最小値とのトレードオフにより、安定性と最適化速度を向上させる。我々は、カメラ推定値を視覚的および数値的に評価する。さらに精度を検証するために、カメラ推定値を4D再構成手法に投入し、得られた3Dシーン、およびレンダリングされた2D RGBと深度マップを評価する。4つの実世界のデータセット（NeRF-DS、DAVIS、iPhone、TUM-dynamics）と1つの合成データセット（MPI-Sintel）を用いて実験を行い、我々の手法が単一のRGBビデオを唯一の教師信号として、より効率的かつ正確にカメラパラメータを推定することを示す。

English

Although COLMAP has long remained the predominant method for camera parameter optimization in static scenes, it is constrained by its lengthy runtime and reliance on ground truth (GT) motion masks for application to dynamic scenes. Many efforts attempted to improve it by incorporating more priors as supervision such as GT focal length, motion masks, 3D point clouds, camera poses, and metric depth, which, however, are typically unavailable in casually captured RGB videos. In this paper, we propose a novel method for more accurate and efficient camera parameter optimization in dynamic scenes solely supervised by a single RGB video. Our method consists of three key components: (1) Patch-wise Tracking Filters, to establish robust and maximally sparse hinge-like relations across the RGB video. (2) Outlier-aware Joint Optimization, for efficient camera parameter optimization by adaptive down-weighting of moving outliers, without reliance on motion priors. (3) A Two-stage Optimization Strategy, to enhance stability and optimization speed by a trade-off between the Softplus limits and convex minima in losses. We visually and numerically evaluate our camera estimates. To further validate accuracy, we feed the camera estimates into a 4D reconstruction method and assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics) and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates camera parameters more efficiently and accurately with a single RGB video as the only supervision.

動的シーンにおけるRGBのみに基づくカメラパラメータ最適化

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

要旨

Support