ChatPaper.aiChatPaper

音频曼巴:用于音频表示学习的双向状态空间模型

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

June 5, 2024
作者: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung
cs.AI

摘要

Transformer已迅速成为音频分类的首选,超越了基于CNN的方法。然而,音频频谱变换器(ASTs)由于自注意力而呈二次扩展。消除这种二次自注意力成本呈现出一种吸引人的方向。最近,状态空间模型(SSMs),如Mamba,在语言和视觉任务中展现了潜力。在这项研究中,我们探讨了自注意力对音频分类任务是否是必要的。通过引入音频Mamba(AuM),这是第一个无自注意力、纯SSM模型用于音频分类,我们旨在解决这个问题。我们在各种音频数据集上评估AuM - 包括六个不同的基准数据集 - 在这些数据集中,它与成熟的AST模型相比取得了可比较或更好的性能。
English
Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.

Summary

AI-Generated Summary

PDF211December 12, 2024