MambaMixer: 이중 토큰 및 채널 선택을 통한 효율적인 선택적 상태 공간 모델

초록

최근 딥러닝의 발전은 주로 데이터 의존성과 대규모 학습 능력으로 인해 트랜스포머(Transformers)에 의존해 왔습니다. 그러나 이러한 아키텍처의 어텐션(attention) 모듈은 입력 크기에 대해 2차 시간 및 공간 복잡도를 보여주며, 이는 장기 시퀀스 모델링에서의 확장성을 제한합니다. 이미지 및 다변량 시계열과 같은 다차원 데이터를 위한 효율적이고 효과적인 아키텍처 백본을 설계하려는 최근의 시도에도 불구하고, 기존 모델들은 데이터 독립적이거나 차원 간 및 차원 내 통신을 허용하지 못하는 한계를 보였습니다. 최근, 효율적인 하드웨어 인식 구현을 갖춘 상태 공간 모델(State Space Models, SSMs), 특히 선택적 상태 공간 모델(Selective State Space Models)이 장기 시퀀스 모델링에서 유망한 잠재력을 보여주었습니다. SSMs의 성공에 영감을 받아, 우리는 토큰과 채널 간의 이중 선택 메커니즘을 사용하는 데이터 의존적 가중치를 가진 새로운 아키텍처인 MambaMixer를 제안합니다. 이는 선택적 토큰 및 채널 믹서(Selective Token and Channel Mixer)라고 불립니다. MambaMixer는 가중 평균 메커니즘을 사용하여 선택적 믹서들을 연결함으로써, 레이어가 초기 특징에 직접 접근할 수 있도록 합니다. 개념 증명으로, 우리는 MambaMixer 블록을 기반으로 Vision MambaMixer(ViM2) 및 Time Series MambaMixer(TSM2) 아키텍처를 설계하고 다양한 비전 및 시계열 예측 작업에서의 성능을 탐구합니다. 우리의 결과는 토큰과 채널 간의 선택적 믹싱의 중요성을 강조합니다. ImageNet 분류, 객체 탐지 및 의미론적 분할 작업에서 ViM2는 잘 알려진 비전 모델들과 경쟁력 있는 성능을 달성하며 SSM 기반 비전 모델들을 능가합니다. 시계열 예측에서 TSM2는 최신 방법들과 비교하여 탁월한 성능을 달성하면서도 계산 비용을 크게 개선합니다. 이러한 결과는 시계열 예측에서 트랜스포머, 교차 채널 어텐션 및 MLPs가 좋은 성능을 위해 충분하지만, 어느 것도 필수적이지 않음을 보여줍니다.

English

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

MambaMixer: 이중 토큰 및 채널 선택을 통한 효율적인 선택적 상태 공간 모델

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

초록

Support