arXiv: 2605.28477v1

SA4Depth:自監督單目深度估計的一致位姿-深度尺度對齊

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

May 27, 2026
作者: Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini
cs.CVcs.CVcs.CV

摘要

從單目序列進行自監督深度估計依賴於深度網絡與姿態網絡的聯合學習。儘管已有大量研究致力於改進深度網絡,但針對姿態網絡的探索仍然有限。在此背景下,即使深度估計僅達至尺度層級,我們仍強調姿態網絡與深度網絡所估計的場景尺度之間對齊的重要性。為此,我們提出SA4Depth方法,旨在提升此對齊效果並增強深度預測,同時保持推理時間不變。我們的方法在訓練過程中利用已估計的深度,將可學習的視覺特徵重投影至連續幀,並通過減少特徵對齊殘差來優化姿態估計。通過此方法,獨立深度網絡與姿態網絡所估計的場景尺度得以對齊,且不同序列間的預測尺度一致性得到改善。我們的可微調優化能無縫整合至現有自監督流程中,並顯著提升其深度估計性能。我們在室外場景KITTI、Cityscapes及室內場景NYUv2上進行了大量實驗,驗證了此方法的有效性。此外,KITTI里程計的結果亦證實了我們姿態優化的效益。我們的代碼已開源於 https://github.com/Runningchauncey/SA4Depth。
English
Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .
PDFMay 28, 2026