청각 토큰의 자기회귀적 예측을 통한 음성 표현

초록

우리는 인간의 청각 처리 계층 구조에서 영감을 받은 2단계 프레임워크를 통해 음성을 인코딩하는 생물학적으로 영감을 받은 모델인 AuriStream을 소개합니다. 첫 번째 단계에서는 인간의 달팽이관을 기반으로 원시 오디오를 시간-주파수 표현으로 변환하고, 여기서 이산적인 달팽이관 토큰을 추출합니다. 두 번째 단계에서는 달팽이관 토큰에 대해 자기회귀 시퀀스 모델을 적용합니다. AuriStream은 의미 있는 음소 및 단어 표현과 최신의 어휘 의미론을 학습합니다. AuriStream은 다양한 하위 SUPERB 음성 작업에서 경쟁력 있는 성능을 보여줍니다. AuriStream의 강력한 표현 능력을 보완하여, 이 모델은 스펙트로그램 공간에서 시각화되고 다시 오디오로 디코딩될 수 있는 오디오의 연속을 생성함으로써 모델의 예측에 대한 통찰을 제공합니다. 요약하면, 우리는 더 인간과 같은 모델의 개발을 촉진하고 다양한 음성 기반 작업을 효율적으로 처리하기 위한 음성 표현 학습을 위한 2단계 프레임워크를 제시합니다.

English

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete cochlear tokens. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

청각 토큰의 자기회귀적 예측을 통한 음성 표현

Representing Speech Through Autoregressive Prediction of Cochlear Tokens

초록

Support