蛇紋瑪巴能學會如何學習嗎?一項關於內文學習任務的比較研究
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
February 6, 2024
作者: Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos
cs.AI
摘要
狀態空間模型(SSMs),如Mamba Gu&Dao(2034),已被提議作為語言建模中替代Transformer網絡的選擇,通過整合閘控、卷積和依賴輸入的標記選擇,以減輕多頭注意力的二次成本。儘管SSMs表現出競爭力,但它們的上下文學習(ICL)能力,這是現代語言模型的一個顯著新興特性,使任務執行無需參數優化,與Transformers相比仍未得到充分探索。在本研究中,我們評估了SSMs的ICL性能,重點放在Mamba上,並與Transformer模型在各種任務中進行比較。我們的結果顯示,SSMs在標準回歸ICL任務中表現與Transformers相當,而在稀疏奇偶學習等任務中表現優於它們。然而,在涉及非標準檢索功能的任務中,SSMs表現不佳。為了解決這些限制,我們引入了一種混合模型,\variant,將Mamba與注意力塊結合,超越了單獨模型在獨立困難任務中的表現。我們的研究結果表明,混合架構為增強語言模型中的ICL提供了有前途的途徑。
English
State-space models (SSMs), such as Mamba Gu & Dao (2034), have been proposed
as alternatives to Transformer networks in language modeling, by incorporating
gating, convolutions, and input-dependent token selection to mitigate the
quadratic cost of multi-head attention. Although SSMs exhibit competitive
performance, their in-context learning (ICL) capabilities, a remarkable
emergent property of modern language models that enables task execution without
parameter optimization, remain underexplored compared to Transformers. In this
study, we evaluate the ICL performance of SSMs, focusing on Mamba, against
Transformer models across various tasks. Our results show that SSMs perform
comparably to Transformers in standard regression ICL tasks, while
outperforming them in tasks like sparse parity learning. However, SSMs fall
short in tasks involving non-standard retrieval functionality. To address these
limitations, we introduce a hybrid model, \variant, that combines Mamba with
attention blocks, surpassing individual models in tasks where they struggle
independently. Our findings suggest that hybrid architectures offer promising
avenues for enhancing ICL in language models.