DM-Codec:用于语音标记化的多模态表示精炼
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
October 19, 2024
作者: Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, A K M Mahbubur Rahman, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Md Mofijul Islam, Amin Ahsan Ali
cs.AI
摘要
最近在语音语言模型方面取得的进展显著提高了语音标记和合成的能力。然而,有效地将语音的复杂、多维属性映射到离散标记仍然具有挑战性。这一过程需要声学、语义和语境信息以精确表示语音。现有的语音表示通常可分为两类:来自音频编解码器的声学标记和来自语音自监督学习模型的语义标记。尽管最近的努力将声学和语义标记统一起来以提高性能,但它们忽视了语境表示在全面语音建模中的关键作用。我们的实证研究表明,缺乏语境表示会导致语音转录中的词错误率(WER)和词信息丢失(WIL)得分升高。为解决这些限制,我们提出了两种新颖的蒸馏方法:(1)一种语言模型(LM)引导的蒸馏方法,将语境信息纳入考虑;(2)一种结合了LM和自监督语音模型(SM)引导的蒸馏技术,有效地将多模态表示(声学、语义和语境)蒸馏为一种全面的语音标记器,称为DM-Codec。DM-Codec架构采用了简化的编码器-解码器框架,配备了一个残差向量量化器(RVQ),并在训练过程中整合了LM和SM。实验证明,DM-Codec在很大程度上优于最先进的语音标记模型,在LibriSpeech基准数据集上将WER降低了高达13.46%,WIL降低了9.82%,语音质量提高了5.84%,可懂度提高了1.85%。代码、样本和模型检查点可在https://github.com/mubtasimahasan/DM-Codec 上获取。
English
Recent advancements in speech-language models have yielded significant
improvements in speech tokenization and synthesis. However, effectively mapping
the complex, multidimensional attributes of speech into discrete tokens remains
challenging. This process demands acoustic, semantic, and contextual
information for precise speech representations. Existing speech representations
generally fall into two categories: acoustic tokens from audio codecs and
semantic tokens from speech self-supervised learning models. Although recent
efforts have unified acoustic and semantic tokens for improved performance,
they overlook the crucial role of contextual representation in comprehensive
speech modeling. Our empirical investigations reveal that the absence of
contextual representations results in elevated Word Error Rate (WER) and Word
Information Lost (WIL) scores in speech transcriptions. To address these
limitations, we propose two novel distillation approaches: (1) a language model
(LM)-guided distillation method that incorporates contextual information, and
(2) a combined LM and self-supervised speech model (SM)-guided distillation
technique that effectively distills multimodal representations (acoustic,
semantic, and contextual) into a comprehensive speech tokenizer, termed
DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder
framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM
during the training process. Experiments show DM-Codec significantly
outperforms state-of-the-art speech tokenization models, reducing WER by up to
13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility
by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model
checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.Summary
AI-Generated Summary