透過耳蝸符號的自迴歸預測來表徵語音
Representing Speech Through Autoregressive Prediction of Cochlear Tokens
August 15, 2025
作者: Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel L. K. Yamins
cs.AI
摘要
我們推出了AuriStream,這是一個受生物啟發的模型,用於通過受人類聽覺處理層次結構啟發的兩階段框架來編碼語音。第一階段將原始音頻轉換為基於人類耳蝸的時頻表示,從中我們提取出離散的耳蝸標記。第二階段在耳蝸標記上應用自回歸序列模型。AuriStream學習到了有意義的音素和詞彙表示,以及最先進的詞彙語義。AuriStream在多樣化的下游SUPERB語音任務中展現了競爭力的表現。作為AuriStream強大表示能力的補充,它能夠生成音頻的延續,這些延續可以在頻譜圖空間中可視化並解碼回音頻,從而提供對模型預測的洞察。總之,我們提出了一個兩階段的語音表示學習框架,以推動開發更接近人類的模型,這些模型能夠高效處理一系列基於語音的任務。
English
We introduce AuriStream, a biologically inspired model for encoding speech
via a two-stage framework inspired by the human auditory processing hierarchy.
The first stage transforms raw audio into a time-frequency representation based
on the human cochlea, from which we extract discrete cochlear tokens.
The second stage applies an autoregressive sequence model over the cochlear
tokens. AuriStream learns meaningful phoneme and word representations, and
state-of-the-art lexical semantics. AuriStream shows competitive performance on
diverse downstream SUPERB speech tasks. Complementing AuriStream's strong
representational capabilities, it generates continuations of audio which can be
visualized in a spectrogram space and decoded back into audio, providing
insights into the model's predictions. In summary, we present a two-stage
framework for speech representation learning to advance the development of more
human-like models that efficiently handle a range of speech-based tasks.