SiMBA: Architettura Basata su Mamba Semplificata per Visione e Serie Temporali Multivariate

Abstract

I Transformer hanno ampiamente adottato reti di attenzione per il mixing di sequenze e MLP per il mixing di canali, svolgendo un ruolo cruciale nel raggiungimento di progressi in vari domini. Tuttavia, la letteratura recente evidenzia problemi legati alle reti di attenzione, tra cui un basso bias induttivo e una complessità quadratica rispetto alla lunghezza della sequenza di input. I Modelli di Stato Spaziale (SSM) come S4 e altri (Hippo, Global Convolutions, liquid S4, LRU, Mega e Mamba) sono emersi per affrontare tali problematiche, aiutando a gestire sequenze più lunghe. Mamba, pur essendo lo SSM più avanzato, presenta problemi di stabilità quando scalato su reti di grandi dimensioni per dataset di computer vision. Proponiamo SiMBA, una nuova architettura che introduce l'Einstein FFT (EinFFT) per la modellazione dei canali attraverso specifici calcoli di autovalori e utilizza il blocco Mamba per la modellazione delle sequenze. Studi approfonditi sulle prestazioni su benchmark di immagini e serie temporali dimostrano che SiMBA supera gli SSM esistenti, colmando il divario di prestazioni con i Transformer più avanzati. In particolare, SiMBA si afferma come il nuovo SSM più avanzato su ImageNet e benchmark di transfer learning come Stanford Car e Flower, nonché su benchmark di task learning e sette dataset di serie temporali. La pagina del progetto è disponibile su questo sito ~https://github.com/badripatro/Simba.

English

Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~https://github.com/badripatro/Simba.

SiMBA: Architettura Basata su Mamba Semplificata per Visione e Serie Temporali Multivariate

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Abstract

Support