SiMBA：簡化的基於瑪巴的視覺和多變量時間序列架構

摘要

Transformer廣泛採用了注意力網絡進行序列混合和MLPs進行通道混合，在各個領域取得了突破性進展。然而，最近的文獻突出了關於注意力網絡的問題，包括對輸入序列長度的低歸納偏差和二次複雜度。像S4和其他SSM（如Hippo、Global Convolutions、liquid S4、LRU、Mega和Mamba）這樣的狀態空間模型已經出現，以解決上述問題，幫助處理更長的序列長度。Mamba作為最先進的SSM，當擴展到大型計算機視覺數據集時存在穩定性問題。我們提出了SiMBA，一種新的架構，引入了Einstein FFT（EinFFT）進行通道建模，通過特定的特徵值計算使用Mamba塊進行序列建模。通過圖像和時間序列基準的廣泛性能研究表明，SiMBA優於現有的SSM，縮小了與最先進的Transformer之間的性能差距。值得注意的是，SiMBA在ImageNet和轉移學習基準以及Stanford Car和Flower等任務學習基準以及七個時間序列基準數據集上確立了自己作為最新的SSM。項目頁面可在此網站上找到：https://github.com/badripatro/Simba。

English

Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~https://github.com/badripatro/Simba.

SiMBA：簡化的基於瑪巴的視覺和多變量時間序列架構

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

摘要

Support