BRIDGE - 단안 깊이 추정을 위한 강화 학습 기반 깊이-이미지 데이터 생성 엔진

초록

단안 깊이 추정(Monocular Depth Estimation, MDE)은 컴퓨터 비전의 기초적인 과제입니다. 기존의 방법들은 데이터의 부족과 품질 문제로 인해 견고성이 제한되었습니다. 이를 극복하기 위해, 우리는 RL(강화 학습) 최적화된 깊이-이미지(Depth-to-Image, D2I) 생성 프레임워크인 BRIDGE를 제안합니다. 이 프레임워크는 다양한 소스 깊이 맵으로부터 2천만 개 이상의 사실적이고 기하학적으로 정확한 RGB 이미지를 합성하며, 각 이미지는 본질적으로 그에 해당하는 실제 깊이 정보와 쌍을 이룹니다. 그런 다음, 우리는 이 데이터셋을 사용하여 깊이 추정 모델을 학습시키는데, 교사 모델의 의사 레이블(pseudo-labels)과 실제 깊이 정보를 통합한 하이브리드 지도 전략을 활용하여 포괄적이고 견고한 학습을 수행합니다. 이 혁신적인 데이터 생성 및 학습 패러다임은 BRIDGE가 규모와 도메인 다양성 측면에서 획기적인 성과를 달성할 수 있게 하며, 기존의 최첨단 접근법들을 정량적으로 능가하고 복잡한 장면 세부 사항을 포착하는 데 있어 일관되게 우수한 성능을 보여줍니다. 이를 통해 일반적이고 견고한 깊이 특징을 발전시킬 수 있습니다. 코드와 모델은 https://dingning-liu.github.io/bridge.github.io/에서 확인할 수 있습니다.

English

Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

BRIDGE - 단안 깊이 추정을 위한 강화 학습 기반 깊이-이미지 데이터 생성 엔진

BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

초록

Support