저수준 비전 작업을 위한 언어 지도의 강건성: 깊이 추정 사례 연구

초록

단안 깊이 추정(monocular depth estimation) 분야의 최근 발전은 자연어를 추가적인 지침으로 통합함으로써 이루어졌습니다. 이러한 접근법은 인상적인 결과를 보여주지만, 특히 일반화(generalization)와 견고성(robustness) 측면에서 언어 사전 정보(language prior)의 영향은 아직 탐구되지 않았습니다. 본 논문에서는 이러한 격차를 해소하기 위해 이 사전 정보의 영향을 정량화하고, 다양한 설정에서 그 효과를 벤치마킹하는 방법을 소개합니다. 우리는 객체 중심의 3차원 공간 관계를 전달하는 "저수준(low-level)" 문장을 생성하고, 이를 추가적인 언어 사전 정보로 통합하여 깊이 추정에 미치는 하류 영향을 평가합니다. 우리의 주요 발견은 현재의 언어 지도 깊이 추정기(language-guided depth estimators)가 장면 수준(scene-level) 설명에서만 최적의 성능을 발휘하며, 반직관적으로 저수준 설명에서는 더 나쁜 성능을 보인다는 것입니다. 추가 데이터를 활용함에도 불구하고, 이러한 방법들은 지시적 적대적 공격(directed adversarial attacks)에 대해 견고하지 못하며, 분포 변화(distribution shift)가 증가함에 따라 성능이 저하됩니다. 마지막으로, 향후 연구를 위한 기초를 제공하기 위해 실패 지점을 식별하고 이러한 단점을 더 잘 이해할 수 있는 통찰을 제시합니다. 깊이 추정을 위해 언어를 사용하는 방법이 증가함에 따라, 우리의 연구 결과는 실제 환경에서 효과적으로 배포하기 위해 신중히 고려해야 할 기회와 함정을 강조합니다.

English

Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally, to provide a foundation for future research, we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation, our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings

저수준 비전 작업을 위한 언어 지도의 강건성: 깊이 추정 사례 연구

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

초록

Support