MiDaS v3.1 -- 一個用於穩健單目相對深度估計的模型庫

摘要

我們推出 MiDaS v3.1 用於單眼深度估計，提供基於不同編碼器骨幹的多種新模型。此版本的推出是受到變壓器在計算機視覺中的成功所啟發，現在有大量預訓練的視覺變壓器可供使用。我們探索如何使用最具潛力的視覺變壓器作為圖像編碼器，影響 MiDaS 結構的深度估計質量和運行時間。我們的研究還包括最近在圖像分類任務中實現與視覺變壓器相當質量的卷積方法。儘管先前版本的 MiDaS v3.0 僅利用基本視覺變壓器 ViT，MiDaS v3.1 提供了基於 BEiT、Swin、SwinV2、Next-ViT 和 LeViT 的其他模型。這些模型提供不同的性能-運行時間折衷。最佳模型將深度估計質量提高了 28%，而高效模型則實現了需要高幀率的下游任務。我們還描述了整合新骨幹的一般過程。可以在 https://youtu.be/UjaeNNFf9sE 找到總結這項工作的視頻，代碼可在 https://github.com/isl-org/MiDaS 找到。

English

We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones. This release is motivated by the success of transformers in computer vision, with a large variety of pretrained vision transformers now available. We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture. Our investigation also includes recent convolutional approaches that achieve comparable quality to vision transformers in image classification tasks. While the previous release MiDaS v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models offer different performance-runtime tradeoffs. The best model improves the depth estimation quality by 28% while efficient models enable downstream tasks requiring high frame rates. We also describe the general process for integrating new backbones. A video summarizing the work can be found at https://youtu.be/UjaeNNFf9sE and the code is available at https://github.com/isl-org/MiDaS.

MiDaS v3.1 -- 一個用於穩健單目相對深度估計的模型庫

MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation

摘要

Support