自监督单目深度估计中的混合粒度特征聚合与从粗到细的语言引导
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
October 10, 2025
作者: Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin
cs.AI
摘要
当前的自监督单目深度估计(MDE)方法因语义-空间知识提取不足而面临性能瓶颈。为解决这一挑战,我们提出了Hybrid-depth,一个创新框架,系统性地整合基础模型(如CLIP和DINO)以提取视觉先验并获取充足的上下文信息用于MDE。我们的方法引入了一个由粗到细的渐进学习框架:1)首先,在对比语言引导下,我们聚合了来自CLIP(全局语义)和DINO(局部空间细节)的多粒度特征。设计了一个比较远近图像块的代理任务,利用文本提示强制深度感知特征对齐;2)接着,在粗粒度特征基础上,整合相机姿态信息和像素级语言对齐以优化深度预测。该模块作为即插即用的深度编码器,无缝融入现有自监督MDE流程(如Monodepth2、ManyDepth),提升连续深度估计效果。通过语言引导聚合CLIP的语义上下文与DINO的空间细节,我们的方法有效解决了特征粒度不匹配问题。在KITTI基准上的大量实验表明,我们的方法在所有指标上均显著超越SOTA方法,同时也确实有益于如BEV感知等下游任务。代码已发布于https://github.com/Zhangwenyao1/Hybrid-depth。
English
Current self-supervised monocular depth estimation (MDE) approaches encounter
performance limitations due to insufficient semantic-spatial knowledge
extraction. To address this challenge, we propose Hybrid-depth, a novel
framework that systematically integrates foundation models (e.g., CLIP and
DINO) to extract visual priors and acquire sufficient contextual information
for MDE. Our approach introduces a coarse-to-fine progressive learning
framework: 1) Firstly, we aggregate multi-grained features from CLIP (global
semantics) and DINO (local spatial details) under contrastive language
guidance. A proxy task comparing close-distant image patches is designed to
enforce depth-aware feature alignment using text prompts; 2) Next, building on
the coarse features, we integrate camera pose information and pixel-wise
language alignment to refine depth predictions. This module seamlessly
integrates with existing self-supervised MDE pipelines (e.g., Monodepth2,
ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth
estimation. By aggregating CLIP's semantic context and DINO's spatial details
through language guidance, our method effectively addresses feature granularity
mismatches. Extensive experiments on the KITTI benchmark demonstrate that our
method significantly outperforms SOTA methods across all metrics, which also
indeed benefits downstream tasks like BEV perception. Code is available at
https://github.com/Zhangwenyao1/Hybrid-depth.