TR2M: 단안 상대 깊이를 언어 설명과 규모 지향 대비를 통해 메트릭 깊이로 전환

초록

본 연구는 상대적 깊이를 미터법 깊이로 변환할 수 있는 일반화 가능한 프레임워크를 제시한다. 현재 단안 깊이 추정 방법은 주로 미터법 깊이 추정(MMDE)과 상대적 깊이 추정(MRDE)으로 나뉜다. MMDE는 미터법 스케일로 깊이를 추정하지만 특정 도메인에 제한되는 경우가 많다. MRDE는 다양한 도메인에서 잘 일반화되지만, 불확실한 스케일로 인해 다운스트림 애플리케이션에 방해가 된다. 이를 해결하기 위해, 우리는 스케일 불확실성을 해결하고 상대적 깊이를 미터법 깊이로 변환하는 프레임워크를 구축하고자 한다. 기존 방법들은 언어를 입력으로 사용하고 재조정을 위한 두 가지 요소를 추정했다. 우리의 접근법인 TR2M은 텍스트 설명과 이미지를 모두 입력으로 활용하고, 픽셀 수준에서 상대적 깊이를 미터법 깊이로 변환하기 위해 두 가지 재조정 맵을 추정한다. 두 모달리티의 특징은 크로스 모달리티 어텐션 모듈을 통해 융합되어 스케일 정보를 더 잘 포착한다. 또한, 더 포괄적인 감독을 위해 신뢰할 수 있는 의사 미터법 깊이를 구성하고 필터링하는 전략을 설계했다. 우리는 스케일 지향적 대조 학습을 개발하여 깊이 분포를 지침으로 활용하여 모델이 스케일 분포와 일치하는 내재적 지식을 학습하도록 강화했다. TR2M은 다양한 도메인의 데이터셋에서 학습하기 위해 소수의 학습 가능한 매개변수만을 활용하며, 실험 결과는 TR2M이 기존 데이터셋에서 뛰어난 성능을 보일 뿐만 아니라 다섯 가지 보이지 않는 데이터셋에서도 우수한 제로샷 능력을 보여준다. 우리는 언어 지원을 통해 픽셀 단위로 상대적 깊이를 미터법 깊이로 변환하는 데 있어 큰 잠재력을 보여준다. (코드는 https://github.com/BeileiCui/TR2M에서 확인할 수 있다.)

English

This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

TR2M: 단안 상대 깊이를 언어 설명과 규모 지향 대비를 통해 메트릭 깊이로 전환

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

초록

Support