BRIDGE - 単眼深度推定のための強化学習型深度画像データ生成エンジン

要旨

単眼深度推定（Monocular Depth Estimation, MDE）は、コンピュータビジョンにおける基盤的なタスクである。従来の手法は、データの不足と品質の制約により、その堅牢性が妨げられてきた。この課題を克服するため、我々はBRIDGEを提案する。これは、強化学習（RL）を最適化した深度から画像（Depth-to-Image, D2I）生成フレームワークであり、多様なソース深度マップから、20M以上の現実的かつ幾何学的に正確なRGB画像を合成し、それぞれに固有の正解深度をペアリングする。次に、このデータセットを用いて深度推定モデルを訓練し、教師モデルの疑似ラベルと正解深度を統合したハイブリッド監視戦略を採用することで、包括的かつ堅牢な訓練を実現する。この革新的なデータ生成と訓練パラダイムにより、BRIDGEは規模とドメイン多様性においてブレークスルーを達成し、既存の最先端手法を定量的に上回り、複雑なシーンの詳細捕捉においても一貫して優れた性能を発揮する。これにより、一般的かつ堅牢な深度特徴が促進される。コードとモデルはhttps://dingning-liu.github.io/bridge.github.io/で公開されている。

English

Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

BRIDGE - 単眼深度推定のための強化学習型深度画像データ生成エンジン

BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

要旨

Support