Than the Teacher 任意の深度を蒸留：蒸留が教師モデルよりも優れた単眼深度推定器を創出する

要旨

単眼深度推定（MDE）は、単一のRGB画像からシーンの深度を予測することを目的とし、3Dシーン理解において重要な役割を果たします。最近のゼロショットMDEの進展では、正規化された深度表現と蒸留ベースの学習を活用することで、多様なシーン間での汎化性能を向上させています。しかし、現在の蒸留における深度正規化手法は、グローバル正規化に依存しており、ノイズの多い疑似ラベルを増幅させ、蒸留の効果を低下させる可能性があります。本論文では、異なる深度正規化戦略が疑似ラベル蒸留に与える影響を系統的に分析します。その結果に基づき、グローバルおよびローカルの深度情報を統合して疑似ラベルの品質を向上させるCross-Context Distillationを提案します。さらに、異なる深度推定モデルの補完的な強みを活用するマルチティーチャー蒸留フレームワークを導入し、よりロバストで正確な深度予測を実現します。ベンチマークデータセットでの広範な実験により、提案手法が定量的および定性的に最先端の手法を大幅に上回ることを示します。

English

Monocular depth estimation (MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.

Than the Teacher 任意の深度を蒸留：蒸留が教師モデルよりも優れた単眼深度推定器を創出する

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

要旨

Support