CostFormer：用于多视图立体匹配中的成本聚合的成本变换器

摘要

多视图立体(Multi-view Stereo, MVS)的核心是参考像素和源像素之间的匹配过程。在这一过程中，代价聚合起着重要作用，而先前的方法主要集中在通过卷积神经网络(CNNs)来处理它。这可能会继承CNNs的固有局限，即由于有限的局部感受野而无法区分重复或不正确的匹配。为了解决这个问题，我们旨在将Transformer引入到代价聚合中。然而，由于Transformer引起的计算复杂度呈二次增长，可能会导致内存溢出和推理延迟等问题。在本文中，我们通过一种高效的基于Transformer的代价聚合网络CostFormer来克服这些限制。我们提出了残差深度感知代价Transformer(RDACT)，通过自注意力机制沿深度和空间维度对代价体积上的长距离特征进行聚合。此外，我们提出了残差回归Transformer(RRT)来增强空间注意力。该方法是一个通用的插件，可用于改进基于学习的MVS方法。

English

The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

CostFormer：用于多视图立体匹配中的成本聚合的成本变换器

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

摘要

Support