Speech Slytherin: 音声分離、認識、合成におけるMambaの性能と効率性の検証

要旨

音声処理におけるMambaがTransformerの優れた代替手段であると結論づけるには時期尚早であり、複数の音声関連タスクにおいて性能と効率の両面でMambaとTransformerを比較する必要がある。この結論に至るため、我々は3つのタスクに対して3つのモデルを提案し評価した：音声分離のためのMamba-TasNet、音声認識のためのConMamba、音声合成のためのVALL-Mである。これらを同規模のTransformerと性能、メモリ使用量、速度の観点で比較した。我々のMambaまたはMamba-Transformerハイブリッドモデルは、対応するTransformerモデル（Sepformer、Conformer、VALL-E）と同等以上の性能を示した。また、音声トークンの解像度に反比例する閾値時間を超える音声長において、メモリと速度の面でTransformerよりも効率的であった。分離タスクにおけるMambaが最も効率的であり、認識タスクにおけるMambaが最も効率が低かった。さらに、閾値時間未満の短い音声においてはMambaはTransformerよりも効率的ではなく、2つの入力のクロスアテンションやマスクアテンションなど、テキストと音声の共同モデリングを必要とするモデルでは性能が劣ることを示した。したがって、MambaとTransformerの優位性は特定の問題やモデルに依存すると主張する。コードはhttps://github.com/xi-j/Mamba-TasNetおよびhttps://github.com/xi-j/Mamba-ASRで公開されている。

English

It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR.

Speech Slytherin: 音声分離、認識、合成におけるMambaの性能と効率性の検証

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

要旨

Support