Wir destillieren jede Tiefe: Destillation erzeugt einen stärkeren monokularen Tiefenschätzer Abstract Monocular depth estimation is a fundamental task in computer vision, with applications ranging from autonomous driving to augmented reality. While recent methods have achieved impressive results, they often rely on large, computationally expensive models. In this work, we propose a novel knowledge distillation framework that enables the training of compact monocular depth estimation models without sacrificing accuracy. Our approach leverages a multi-scale distillation strategy that transfers knowledge from a powerful teacher network to a lightweight student network at different feature levels. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance while significantly reducing model complexity. Furthermore, we show that our distilled models generalize better to unseen domains compared to their larger counterparts. This work provides a practical solution for deploying accurate monocular depth estimation in resource-constrained environments.

papers.abstract

Die monokulare Tiefenschätzung (MDE) zielt darauf ab, die Szenentiefe aus einem einzelnen RGB-Bild vorherzusagen und spielt eine entscheidende Rolle beim Verständnis von 3D-Szenen. Jüngste Fortschritte in der Zero-Shot-MDE nutzen normalisierte Tiefendarstellungen und destillationsbasiertes Lernen, um die Generalisierung über verschiedene Szenen hinweg zu verbessern. Allerdings können aktuelle Tiefennormalisierungsmethoden für die Destillation, die auf globaler Normalisierung beruhen, verrauschte Pseudolabels verstärken und so die Effektivität der Destillation verringern. In diesem Artikel analysieren wir systematisch die Auswirkungen verschiedener Tiefennormalisierungsstrategien auf die Pseudolabel-Destillation. Basierend auf unseren Erkenntnissen schlagen wir Cross-Context Distillation vor, das globale und lokale Tiefenhinweise integriert, um die Qualität der Pseudolabels zu verbessern. Zusätzlich führen wir ein Multi-Teacher-Destillationsframework ein, das die komplementären Stärken verschiedener Tiefenschätzungsmodelle nutzt, was zu robusteren und genaueren Tiefenvorhersagen führt. Umfangreiche Experimente auf Benchmark-Datensätzen zeigen, dass unser Ansatz state-of-the-art Methoden sowohl quantitativ als auch qualitativ deutlich übertrifft.

English

Monocular depth estimation (MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

papers.abstract

Support