コクレアトークンの自己回帰予測による音声表現

要旨

我々は、人間の聴覚処理階層に着想を得た二段階フレームワークを通じて音声を符号化する生物学的にインスパイアされたモデル「AuriStream」を提案する。第一段階では、生の音声を人間の蝸牛に基づく時間-周波数表現に変換し、そこから離散的な蝸牛トークンを抽出する。第二段階では、蝸牛トークンに対して自己回帰型シーケンスモデルを適用する。AuriStreamは、意味のある音素および単語表現を学習し、最先端の語彙的意味論を獲得する。AuriStreamは、多様な下流SUPERB音声タスクにおいて競争力のある性能を示す。AuriStreamの強力な表現能力を補完するものとして、このモデルは音声の継続部分を生成し、それをスペクトログラム空間で可視化し、音声にデコードすることが可能であり、モデルの予測に関する洞察を提供する。要約すると、我々は、より人間らしいモデルの開発を促進し、幅広い音声ベースのタスクを効率的に処理するための音声表現学習の二段階フレームワークを提示する。

English

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete cochlear tokens. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

コクレアトークンの自己回帰予測による音声表現

Representing Speech Through Autoregressive Prediction of Cochlear Tokens

要旨

Support