CostFormer：マルチビューステレオにおけるコスト集約のためのコストトランスフォーマー

要旨

マルチビューステレオ（MVS）の核心は、参照ピクセルとソースピクセル間のマッチングプロセスにあります。このプロセスにおいて、コスト集約は重要な役割を果たしますが、従来の手法は主にCNNを用いてこれを処理することに焦点を当てていました。しかし、これはCNNの自然な制限、すなわち局所的な受容野の限界により、繰り返しパターンや誤ったマッチングを識別できないという問題を引き継ぐ可能性があります。この問題に対処するため、我々はTransformerをコスト集約に組み込むことを目指します。しかし、Transformerに起因する計算量の二次的な増加により、メモリオーバーフローや推論の遅延といった別の問題が発生する可能性があります。本論文では、これらの制限を克服するために、効率的なTransformerベースのコスト集約ネットワーク、すなわちCostFormerを提案します。Residual Depth-Aware Cost Transformer（RDACT）は、深度および空間次元に沿った自己注意メカニズムを介してコストボリューム上の長距離特徴を集約するために提案されました。さらに、Residual Regression Transformer（RRT）は、空間的注意を強化するために提案されました。提案手法は、学習ベースのMVS手法を改善するための汎用的なプラグインとして機能します。

English

The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

CostFormer：マルチビューステレオにおけるコスト集約のためのコストトランスフォーマー

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

要旨

Support