動態場景中僅基於RGB的相機參數優化
RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes
September 18, 2025
作者: Fang Li, Hao Zhang, Narendra Ahuja
cs.AI
摘要
儘管COLMAP長期以來一直是靜態場景中相機參數優化的主流方法,但其在動態場景中的應用受到運行時間長及依賴於地面真實(GT)運動遮罩的限制。許多嘗試通過引入更多先驗知識作為監督來改進它,例如GT焦距、運動遮罩、3D點雲、相機姿態和度量深度,然而這些信息在隨意拍攝的RGB視頻中通常難以獲得。本文提出了一種新方法,僅依賴單一RGB視頻作為監督,在動態場景中實現更精確且高效的相機參數優化。我們的方法包含三個關鍵組件:(1)基於塊的跟踪濾波器,用於在RGB視頻中建立穩健且最大程度稀疏的鉸鏈式關係;(2)異常值感知的聯合優化,通過自適應降低移動異常值的權重,無需依賴運動先驗,實現高效的相機參數優化;(3)兩階段優化策略,通過在損失函數的Softplus限制與凸最小值之間權衡,提升穩定性和優化速度。我們對相機估計結果進行了視覺和數值評估。為進一步驗證準確性,我們將相機估計結果輸入到4D重建方法中,並評估生成的3D場景以及渲染的2D RGB和深度圖。我們在4個真實世界數據集(NeRF-DS、DAVIS、iPhone和TUM-dynamics)和1個合成數據集(MPI-Sintel)上進行了實驗,結果表明,我們的方法僅以單一RGB視頻作為監督,能夠更高效且精確地估計相機參數。
English
Although COLMAP has long remained the predominant method for camera parameter
optimization in static scenes, it is constrained by its lengthy runtime and
reliance on ground truth (GT) motion masks for application to dynamic scenes.
Many efforts attempted to improve it by incorporating more priors as
supervision such as GT focal length, motion masks, 3D point clouds, camera
poses, and metric depth, which, however, are typically unavailable in casually
captured RGB videos. In this paper, we propose a novel method for more accurate
and efficient camera parameter optimization in dynamic scenes solely supervised
by a single RGB video. Our method consists of three key components: (1)
Patch-wise Tracking Filters, to establish robust and maximally sparse
hinge-like relations across the RGB video. (2) Outlier-aware Joint
Optimization, for efficient camera parameter optimization by adaptive
down-weighting of moving outliers, without reliance on motion priors. (3) A
Two-stage Optimization Strategy, to enhance stability and optimization speed by
a trade-off between the Softplus limits and convex minima in losses. We
visually and numerically evaluate our camera estimates. To further validate
accuracy, we feed the camera estimates into a 4D reconstruction method and
assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform
experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics)
and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates
camera parameters more efficiently and accurately with a single RGB video as
the only supervision.