通过耳蜗标记的自回归预测实现语音表征

摘要

我们推出了AuriStream，这是一种受生物启发的模型，通过一个两阶段框架来编码语音，该框架灵感源自人类听觉处理层次结构。第一阶段将原始音频转换为基于人耳蜗的时间-频率表示，并从中提取离散的耳蜗标记。第二阶段则在耳蜗标记上应用自回归序列模型。AuriStream能够学习到有意义的音素和词汇表示，以及最先进的词汇语义。在多种下游SUPERB语音任务中，AuriStream展现了极具竞争力的性能。除了强大的表征能力外，AuriStream还能生成音频的延续部分，这些部分可在频谱图中可视化并解码回音频，为模型预测提供了深入洞察。总之，我们提出了一个两阶段的语音表示学习框架，旨在推动开发出更类人、能高效处理各类语音任务的模型。

English

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete cochlear tokens. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

通过耳蜗标记的自回归预测实现语音表征

Representing Speech Through Autoregressive Prediction of Cochlear Tokens

摘要

Support