稀疏化狀態空間模型是高效的高速網路架構

摘要

狀態空間模型（SSMs）為序列建模提供了一種極具前景的架構，通過以線性遞歸取代昂貴的自注意力機制，為Transformer提供了一種替代方案。本文提出了一種簡單而有效的技巧，通過稀疏化來在給定的計算預算內增強SSMs。我們的直覺是，由於漸進的遞歸更新，SSMs中的令牌存在高度冗餘，而密集的遞歸操作阻礙了過去信息的傳遞。特別地，我們觀察到SSMs的上層因編碼全局信息而趨於更為冗餘，而下層則編碼局部信息。基於此，我們引入了Simba，一種基於令牌剪枝的SSMs分層稀疏化方法。Simba對上層進行比下層更為顯著的稀疏化，促使上層表現得像高速公路一樣。為實現這一點，我們提出了一種新穎的SSMs令牌剪枝標準，通過累積局部遞歸來衡量令牌對最終輸出的全局影響。我們證明，在各種自然語言任務中，Simba在相同浮點運算次數（FLOPS）下優於基準模型Mamba。此外，我們展示了高速公路效應，表明Simba不僅提升了效率，還改善了長序列間的信息流動。代碼已發佈於https://github.com/woominsong/Simba。

English

State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/Simba.

稀疏化狀態空間模型是高效的高速網路架構

Sparsified State-Space Models are Efficient Highway Networks

摘要

Support