VMamba: 시각적 상태 공간 모델

초록

컨볼루션 신경망(CNNs)과 비전 트랜스포머(ViTs)는 시각적 표현 학습을 위한 두 가지 가장 인기 있는 기반 모델로 자리 잡고 있습니다. CNNs는 이미지 해상도에 대해 선형 복잡도를 보이며 뛰어난 확장성을 보여주는 반면, ViTs는 2차 복잡도를 겪음에도 불구하고 더 우수한 적합 능력을 보입니다. 자세히 살펴보면, ViTs는 전역 수용 영역과 동적 가중치를 통합함으로써 더 우수한 시각적 모델링 성능을 달성합니다. 이러한 관찰은 우리가 이러한 요소를 계승하면서 계산 효율성을 향상시킨 새로운 아키텍처를 제안하도록 동기를 부여합니다. 이를 위해 최근에 소개된 상태 공간 모델에서 영감을 얻어, 전역 수용 영역을 희생하지 않으면서 선형 복잡도를 달성하는 Visual State Space Model(VMamba)을 제안합니다. 또한, 방향 민감성 문제를 해결하기 위해 공간 영역을 탐색하고 비인과적 시각 이미지를 순서 패치 시퀀스로 변환하는 Cross-Scan Module(CSM)을 도입합니다. 광범위한 실험 결과는 VMamba가 다양한 시각 인식 작업에서 유망한 능력을 보여줄 뿐만 아니라, 이미지 해상도가 증가함에 따라 기존 벤치마크 대비 더 두드러진 이점을 보인다는 것을 입증합니다. 소스 코드는 https://github.com/MzeroMiko/VMamba에서 확인할 수 있습니다.

English

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.

VMamba: 시각적 상태 공간 모델

VMamba: Visual State Space Model

초록

Support