arXiv: 2605.28477v1

SA4Depth: 用于自监督单目深度估计的一致姿态-深度尺度对齐

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

May 27, 2026
作者: Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini
cs.CVcs.CVcs.CV

摘要

从单目序列进行自监督深度估计依赖于深度网络与位姿网络的联合学习。尽管已有大量研究致力于改进深度网络,但对位姿网络的探索仍相对有限。在此背景下,即使深度估计已达到尺度一致性,我们仍强调位姿网络与深度网络所估计场景尺度之间对齐的重要性。为此,我们提出SA4Depth方法,旨在改善这种对齐关系并提升深度预测性能,同时保持推理时间不变。该方法利用训练期间估计的深度,将可学习的视觉特征跨连续帧进行重投影,并通过减少特征对齐残差来优化位姿估计。通过我们的方法,独立深度网络与位姿网络所估计的场景尺度得以对齐,且不同序列间的预测尺度一致性得到改善。这种可微的优化过程可无缝集成至现有自监督框架中,显著提升其深度估计质量。我们在KITTI、Cityscapes和NYUv2数据集上进行了广泛的室外与室内实验验证,同时KITTI里程计数据集的结果证实了位姿优化的有效性。相关代码已开源至https://github.com/Runningchauncey/SA4Depth。
English
Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .
PDFMay 28, 2026