ChatPaper.aiChatPaper

TR2M:通过语言描述与尺度导向对比,将单目相对深度转换为度量深度

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

June 16, 2025
作者: Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
cs.AI

摘要

本研究提出了一种可推广的框架,旨在将相对深度转换为度量深度。当前的单目深度估计方法主要分为度量深度估计(MMDE)和相对深度估计(MRDE)。MMDE在度量尺度上估计深度,但通常局限于特定领域。MRDE在不同领域间泛化能力强,但存在尺度不确定性,阻碍了下游应用。为此,我们致力于构建一个框架,以解决尺度不确定性并将相对深度转换为度量深度。先前的方法利用语言作为输入,估计两个因子进行重缩放。我们的方法TR2M,同时采用文本描述和图像作为输入,估计两个重缩放图,在像素级别将相对深度转换为度量深度。通过跨模态注意力模块融合两种模态的特征,以更好地捕捉尺度信息。设计了一种策略来构建并筛选置信的伪度量深度,以实现更全面的监督。我们还开发了面向尺度的对比学习,利用深度分布作为指导,强化模型学习与尺度分布一致的内在知识。TR2M仅利用少量可训练参数,在多个领域的数据集上进行训练,实验不仅展示了TR2M在已知数据集上的优异性能,还揭示了其在五个未见数据集上的卓越零样本能力。我们展示了在语言辅助下,逐像素将相对深度转换为度量深度的巨大潜力。(代码已开源:https://github.com/BeileiCui/TR2M)
English
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
PDF12June 18, 2025