音頻瑪巴:用於音頻表示學習的雙向狀態空間模型
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
June 5, 2024
作者: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung
cs.AI
摘要
Transformer 迅速成為音頻分類的首選,超越基於 CNN 的方法。然而,音頻頻譜 Transformer (ASTs) 由於自注意機制而呈現二次擴展。消除這種二次自注意成本呈現出一個吸引人的方向。最近,狀態空間模型 (SSMs),如 Mamba,在語言和視覺任務中展示了潛力。在這項研究中,我們探討自注意對音頻分類任務是否必要。通過引入 Audio Mamba (AuM),這是第一個無自注意、純粹基於 SSM 的音頻分類模型,我們旨在解決這個問題。我們在各種音頻數據集上評估 AuM - 包括六個不同的基準測試 - 在這些測試中,AuM 與成熟的 AST 模型相比實現了可比或更好的性能。
English
Transformers have rapidly become the preferred choice for audio
classification, surpassing methods based on CNNs. However, Audio Spectrogram
Transformers (ASTs) exhibit quadratic scaling due to self-attention. The
removal of this quadratic self-attention cost presents an appealing direction.
Recently, state space models (SSMs), such as Mamba, have demonstrated potential
in language and vision tasks in this regard. In this study, we explore whether
reliance on self-attention is necessary for audio classification tasks. By
introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based
model for audio classification, we aim to address this question. We evaluate
AuM on various audio datasets - comprising six different benchmarks - where it
achieves comparable or better performance compared to well-established AST
model.Summary
AI-Generated Summary