ChatPaper.aiChatPaper

MambaMixer:具有雙令牌和通道選擇的高效選擇性狀態空間模型

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

March 29, 2024
作者: Ali Behrouz, Michele Santacatterina, Ramin Zabih
cs.AI

摘要

近年來深度學習的最新進展主要依賴於Transformer,因為它們具有對數據的依賴性和在大規模學習方面的能力。然而,在這些架構中的注意力模塊展現出輸入大小的二次時間和空間,限制了它們在長序列建模方面的可擴展性。儘管最近有嘗試為多維數據(如圖像和多變量時間序列)設計高效且有效的架構骨幹,現有模型要麼是獨立於數據,要麼無法實現跨維度和內部維度的通信。最近,具有高效硬體感知實現的狀態空間模型(SSMs),尤其是具有選擇性的狀態空間模型,展現出對長序列建模的潛力。受SSMs成功的啟發,我們提出了MambaMixer,一種具有數據依賴權重的新架構,它使用跨記號和通道的雙重選擇機制,稱為選擇性記號和通道混合器。MambaMixer通過加權平均機制連接選擇性混合器,使得層可以直接訪問早期特徵。作為概念證明,我們基於MambaMixer塊設計了Vision MambaMixer(ViM2)和Time Series MambaMixer(TSM2)架構,並探索它們在各種視覺和時間序列預測任務中的性能。我們的結果突顯了跨記號和通道的選擇性混合的重要性。在ImageNet分類、物體檢測和語義分割任務中,ViM2與眾多知名視覺模型實現了競爭性性能,並超越了基於SSM的視覺模型。在時間序列預測方面,TSM2相較於最先進的方法實現了優異的性能,同時顯著提高了計算成本。這些結果表明,儘管Transformer、跨通道注意力和MLPs對於時間序列預測的良好性能是足夠的,但並非必要。
English
Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

Summary

AI-Generated Summary

PDF121November 26, 2024