DM-Codec：為語音標記提煉多模態表示

摘要

最近在語音語言模型方面取得的進展顯著提高了語音標記化和合成的效果。然而，將語音的複雜多維屬性有效映射為離散標記仍然具有挑戰性。這個過程需要聲學、語義和上下文信息以精確表示語音。現有的語音表示通常可分為兩類：來自音頻編解碼器的聲學標記和來自語音自監督學習模型的語義標記。儘管最近的努力統一了聲學和語義標記以提高性能，但它們忽略了上下文表示在全面語音建模中的關鍵作用。我們的實證研究顯示，缺乏上下文表示導致語音轉錄中的詞錯誤率（WER）和詞信息損失（WIL）分數升高。為了解決這些限制，我們提出了兩種新的精煉方法：（1）一種以語言模型（LM）為指導的精煉方法，將上下文信息納入其中，以及（2）一種結合LM和自監督語音模型（SM）為指導的精煉技術，有效將多模態表示（聲學、語義和上下文）精煉為全面的語音標記器，稱為DM-Codec。DM-Codec架構採用了簡化的編碼器-解碼器框架，並配備了一個剩餘向量量化器（RVQ），在訓練過程中整合了LM和SM。實驗表明，DM-Codec明顯優於最先進的語音標記化模型，將WER降低了高達13.46％，WIL降低了9.82％，並將語音質量提高了5.84％，可讀性提高了1.85％，在LibriSpeech基準數據集上。代碼、樣本和模型檢查點可在https://github.com/mubtasimahasan/DM-Codec找到。

English

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

DM-Codec：為語音標記提煉多模態表示

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

摘要

Support