MiDaS v3.1 -- 一个用于稳健单目相对深度估计的模型库
MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
July 26, 2023
作者: Reiner Birkl, Diana Wofk, Matthias Müller
cs.AI
摘要
我们发布了 MiDaS v3.1 用于单目深度估计,提供了基于不同编码器骨干的多种新模型。此次发布受到了变压器在计算机视觉中的成功启发,现在有大量预训练视觉变压器可供使用。我们探讨了如何利用最具前景的视觉变压器作为图像编码器来影响 MiDaS 架构的深度估计质量和运行时间。我们的研究还包括最近在图像分类任务中实现与视觉变压器相媲美质量的卷积方法。在之前的 MiDaS v3.0 仅利用基础视觉变压器 ViT 的基础上,MiDaS v3.1 提供了基于 BEiT、Swin、SwinV2、Next-ViT 和 LeViT 的额外模型。这些模型提供了不同的性能-运行时权衡。最佳模型将深度估计质量提高了 28%,而高效模型则实现了需要高帧率的下游任务。我们还描述了集成新骨干的一般过程。可以在 https://youtu.be/UjaeNNFf9sE 找到总结该工作的视频,代码可在 https://github.com/isl-org/MiDaS 获取。
English
We release MiDaS v3.1 for monocular depth estimation, offering a variety of
new models based on different encoder backbones. This release is motivated by
the success of transformers in computer vision, with a large variety of
pretrained vision transformers now available. We explore how using the most
promising vision transformers as image encoders impacts depth estimation
quality and runtime of the MiDaS architecture. Our investigation also includes
recent convolutional approaches that achieve comparable quality to vision
transformers in image classification tasks. While the previous release MiDaS
v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers
additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models
offer different performance-runtime tradeoffs. The best model improves the
depth estimation quality by 28% while efficient models enable downstream tasks
requiring high frame rates. We also describe the general process for
integrating new backbones. A video summarizing the work can be found at
https://youtu.be/UjaeNNFf9sE and the code is available at
https://github.com/isl-org/MiDaS.