TR2M：透過語言描述與尺度導向對比，將單目相對深度轉換為度量深度

摘要

本研究提出了一個可泛化的框架，用於將相對深度轉換為度量深度。當前的單目深度估計方法主要分為度量深度估計（MMDE）和相對深度估計（MRDE）。MMDE以度量尺度估計深度，但通常局限於特定領域。MRDE在不同領域間具有良好的泛化能力，但其尺度不確定性阻礙了下游應用。為此，我們旨在構建一個框架來解決尺度不確定性問題，並將相對深度轉換為度量深度。先前的方法使用語言作為輸入，並估計兩個因子來進行重新縮放。我們的方法TR2M則同時利用文本描述和圖像作為輸入，並估計兩個重新縮放映射，以在像素級別將相對深度轉換為度量深度。通過跨模態注意力模塊融合來自兩種模態的特徵，以更好地捕捉尺度信息。我們設計了一種策略來構建和篩選置信的偽度量深度，以實現更全面的監督。此外，我們還開發了面向尺度的對比學習，利用深度分佈作為指導，促使模型學習與尺度分佈一致的內在知識。TR2M僅利用少量可訓練參數，在多個領域的數據集上進行訓練，實驗不僅展示了TR2M在已知數據集上的優異性能，還揭示了其在五個未見數據集上卓越的零樣本能力。我們展示了在語言輔助下，將相對深度逐像素轉換為度量深度的巨大潛力。（代碼可在以下網址獲取：https://github.com/BeileiCui/TR2M）

English

This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

TR2M：透過語言描述與尺度導向對比，將單目相對深度轉換為度量深度

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

摘要

Support