희소화된 상태-공간 모델은 효율적인 고속도로 네트워크입니다.

초록

상태 공간 모델(SSMs)은 시퀀스 모델링을 위한 유망한 아키텍처로, 비용이 많이 드는 자기 주의(self-attention)를 선형 순환으로 대체하여 트랜스포머에 대한 대안을 제공합니다. 본 논문에서는 주어진 계산 예산 내에서 SSMs의 성능을 향상시키기 위해 희소화(sparsification)라는 간단하면서도 효과적인 기법을 제안합니다. 우리의 직관은 SSMs의 토큰들이 점진적인 순환 업데이트로 인해 높은 중복성을 가지며, 밀집된 순환 연산이 과거 정보의 전달을 방해한다는 것입니다. 특히, SSMs의 상위 레이어는 전역 정보를 인코딩하므로 더 많은 중복성을 보이는 반면, 하위 레이어는 지역 정보를 인코딩한다는 것을 관찰했습니다. 이를 바탕으로, 우리는 토큰 가지치기(token pruning)를 기반으로 한 SSMs의 계층적 희소화 방법인 Simba를 소개합니다. Simba는 상위 레이어를 하위 레이어보다 더 많이 희소화하여 상위 레이어가 고속도로(highway)처럼 동작하도록 유도합니다. 이를 위해, 우리는 SSMs를 위한 새로운 토큰 가지치기 기준을 제안하며, 이는 지역 순환을 누적하여 토큰의 최종 출력에 대한 전역적 영향을 측정합니다. 우리는 Simba가 동일한 FLOPS를 사용하는 기준 모델인 Mamba보다 다양한 자연어 처리 작업에서 더 나은 성능을 보임을 입증합니다. 또한, 고속도로의 효과를 설명하며, Simba가 효율성을 향상시킬 뿐만 아니라 긴 시퀀스 간의 정보 흐름도 개선함을 보여줍니다. 코드는 https://github.com/woominsong/Simba에서 확인할 수 있습니다.

English

State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/Simba.

희소화된 상태-공간 모델은 효율적인 고속도로 네트워크입니다.

Sparsified State-Space Models are Efficient Highway Networks

초록

Support