CostFormer：用於多視角立體匹配中的成本聚合的成本轉換器

摘要

多視角立體匹配(Multi-view Stereo, MVS)的核心是參考像素和來源像素之間的匹配過程。在這個過程中，成本聚合扮演著重要角色，先前的方法主要著重於通過卷積神經網絡(CNNs)來處理它。這可能會繼承CNNs的自然限制，無法區分重複或不正確的匹配，因為其局部感受野有限。為了應對這個問題，我們旨在將Transformer引入成本聚合中。然而，另一個問題可能會出現，這是由於Transformer引起的計算複雜度呈二次增長，導致內存溢出和推論延遲。在本文中，我們通過一個高效的基於Transformer的成本聚合網絡，即CostFormer，克服了這些限制。提出了殘差深度感知成本Transformer(RDACT)，通過自注意機制在深度和空間維度上聚合長程特徵。此外，提出了殘差回歸Transformer(RRT)來增強空間注意力。該方法是一個通用的插件，可改善基於學習的MVS方法。

English

The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

CostFormer：用於多視角立體匹配中的成本聚合的成本轉換器

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

摘要

Support