CostFormer: 다중 뷰 스테레오에서 비용 집계를 위한 비용 변환기

초록

멀티뷰 스테레오(Multi-view Stereo, MVS)의 핵심은 참조 픽셀과 소스 픽셀 간의 매칭 과정입니다. 이 과정에서 비용 집계(cost aggregation)는 중요한 역할을 하며, 기존 방법들은 주로 CNN(Convolutional Neural Networks)을 통해 이를 처리하는 데 초점을 맞추었습니다. 그러나 CNN은 제한된 지역 수용 필드(local receptive field)로 인해 반복적이거나 잘못된 매칭을 구분하지 못하는 고유한 한계를 지니고 있습니다. 이러한 문제를 해결하기 위해 우리는 Transformer를 비용 집계 과정에 도입하고자 합니다. 하지만 Transformer로 인해 계산 복잡도가 제곱적으로 증가하면서 메모리 오버플로우와 추론 지연 문제가 발생할 수 있습니다. 본 논문에서는 이러한 한계를 극복하기 위해 효율적인 Transformer 기반 비용 집계 네트워크인 CostFormer를 제안합니다. 깊이 및 공간 차원에서 자기 주의 메커니즘(self-attention mechanism)을 통해 비용 볼륨(cost volume)의 장거리 특징을 집계하기 위해 잔여 깊이 인식 비용 Transformer(Residual Depth-Aware Cost Transformer, RDACT)를 제안합니다. 또한, 공간 주의력을 강화하기 위해 잔여 회귀 Transformer(Residual Regression Transformer, RRT)를 제안합니다. 제안된 방법은 학습 기반 MVS 방법을 개선하기 위한 범용 플러그인으로 사용될 수 있습니다.

English

The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

CostFormer: 다중 뷰 스테레오에서 비용 집계를 위한 비용 변환기

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

초록

Support