TR2M: 単眼相対深度を言語記述とスケール指向コントラストを用いてメトリック深度に変換する

要旨

本研究は、相対深度をメトリック深度に変換する汎用可能なフレームワークを提案する。現在の単眼深度推定手法は、主にメトリック深度推定（MMDE）と相対深度推定（MRDE）に分類される。MMDEはメトリックスケールで深度を推定するが、特定の領域に限定されることが多い。一方、MRDEは異なる領域間で良好に汎化するが、スケールが不確定であるため、下流のアプリケーションに支障をきたす。この問題を解決するため、我々はスケールの不確実性を解消し、相対深度をメトリック深度に変換するフレームワークを構築することを目指す。従来の手法では、言語を入力として使用し、リスケーリングを行うための2つの因子を推定していた。我々のアプローチであるTR2Mは、テキスト記述と画像の両方を入力として利用し、ピクセルレベルで相対深度をメトリック深度に変換するための2つのリスケールマップを推定する。2つのモダリティからの特徴は、クロスモダリティアテンションモジュールを用いて融合され、スケール情報をより効果的に捕捉する。さらに、信頼性の高い疑似メトリック深度を構築し、フィルタリングするための戦略を設計し、より包括的な監督を実現する。また、スケール指向のコントラスティブラーニングを開発し、深度分布をガイダンスとして利用して、スケール分布に整合する内在的知識をモデルに学習させる。TR2Mは、様々な領域のデータセットで学習するために少数の学習可能なパラメータのみを利用し、既知のデータセットでの優れた性能だけでなく、5つの未知のデータセットでの優れたゼロショット能力も示す。言語支援によるピクセル単位での相対深度からメトリック深度への変換の大きな可能性を示す。（コードはhttps://github.com/BeileiCui/TR2Mで公開されている）

English

This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

TR2M: 単眼相対深度を言語記述とスケール指向コントラストを用いてメトリック深度に変換する

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

要旨

Support