SiMBA: Vereenvoudigde Mamba-gebaseerde Architectuur voor Visie en Multivariate Tijdreeksen

Samenvatting

Transformers hebben aandachtnetwerken veelvuldig geadopteerd voor sequentiemenging en MLP's (Multilayer Perceptrons) voor kanaalmenging, wat een cruciale rol heeft gespeeld bij het bereiken van doorbraken in verschillende domeinen. Recente literatuur benadrukt echter problemen met aandachtnetwerken, waaronder een lage inductieve bias en kwadratische complexiteit ten opzichte van de invoersequentielengte. State Space Models (SSM's) zoals S4 en andere (Hippo, Global Convolutions, liquid S4, LRU, Mega en Mamba) zijn ontstaan om deze problemen aan te pakken en langere sequentielengtes te hanteren. Mamba, hoewel het de state-of-the-art SSM is, heeft een stabiliteitsprobleem wanneer het wordt opgeschaald naar grote netwerken voor computervisie-datasets. Wij stellen SiMBA voor, een nieuwe architectuur die Einstein FFT (EinFFT) introduceert voor kanaalmodellering door specifieke eigenwaardeberekeningen en het Mamba-blok gebruikt voor sequentiemodellering. Uitgebreide prestatiestudies over beeld- en tijdreeksbenchmarks tonen aan dat SiMBA bestaande SSM's overtreft en de prestatiekloof met state-of-the-art transformers overbrugt. Opmerkelijk is dat SiMBA zichzelf vestigt als de nieuwe state-of-the-art SSM op ImageNet en transfer learning benchmarks zoals Stanford Car en Flower, evenals taakleerbenchmarks en zeven tijdreeksbenchmarkdatasets. De projectpagina is beschikbaar op deze website ~https://github.com/badripatro/Simba.

English

Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~https://github.com/badripatro/Simba.

SiMBA: Vereenvoudigde Mamba-gebaseerde Architectuur voor Visie en Multivariate Tijdreeksen

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Samenvatting

Support