MiDaS v3.1 -- 一個用於穩健單目相對深度估計的模型庫
MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
July 26, 2023
作者: Reiner Birkl, Diana Wofk, Matthias Müller
cs.AI
摘要
我們推出 MiDaS v3.1 用於單眼深度估計,提供基於不同編碼器骨幹的多種新模型。此版本的推出是受到變壓器在計算機視覺中的成功所啟發,現在有大量預訓練的視覺變壓器可供使用。我們探索如何使用最具潛力的視覺變壓器作為圖像編碼器,影響 MiDaS 結構的深度估計質量和運行時間。我們的研究還包括最近在圖像分類任務中實現與視覺變壓器相當質量的卷積方法。儘管先前版本的 MiDaS v3.0 僅利用基本視覺變壓器 ViT,MiDaS v3.1 提供了基於 BEiT、Swin、SwinV2、Next-ViT 和 LeViT 的其他模型。這些模型提供不同的性能-運行時間折衷。最佳模型將深度估計質量提高了 28%,而高效模型則實現了需要高幀率的下游任務。我們還描述了整合新骨幹的一般過程。可以在 https://youtu.be/UjaeNNFf9sE 找到總結這項工作的視頻,代碼可在 https://github.com/isl-org/MiDaS 找到。
English
We release MiDaS v3.1 for monocular depth estimation, offering a variety of
new models based on different encoder backbones. This release is motivated by
the success of transformers in computer vision, with a large variety of
pretrained vision transformers now available. We explore how using the most
promising vision transformers as image encoders impacts depth estimation
quality and runtime of the MiDaS architecture. Our investigation also includes
recent convolutional approaches that achieve comparable quality to vision
transformers in image classification tasks. While the previous release MiDaS
v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers
additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models
offer different performance-runtime tradeoffs. The best model improves the
depth estimation quality by 28% while efficient models enable downstream tasks
requiring high frame rates. We also describe the general process for
integrating new backbones. A video summarizing the work can be found at
https://youtu.be/UjaeNNFf9sE and the code is available at
https://github.com/isl-org/MiDaS.