DM-Codec: 発話トークン化のためのマルチモーダル表現の蒸留

要旨

最近の音声言語モデルの進歩により、音声のトークン化と合成において著しい改善がもたらされています。ただし、音声の複雑で多次元の属性を明確なトークンに効果的にマッピングすることは依然として困難です。このプロセスでは、正確な音声表現のために音響、意味、および文脈情報が必要とされます。既存の音声表現は一般的に、オーディオコーデックからの音響トークンと音声の自己教師あり学習モデルからの意味トークンの2つのカテゴリに分類されます。最近の取り組みでは、音響と意味のトークンを統合して性能を向上させていますが、包括的な音声モデリングにおける文脈表現の重要性を見落としています。私たちの経験的調査によると、文脈表現の欠如は音声転写において単語誤り率（WER）と単語情報損失（WIL）スコアの上昇につながります。これらの制限に対処するために、2つの新しい蒸留アプローチを提案します：（1）文脈情報を組み込む言語モデル（LM）による蒸留方法、および（2）効果的に多モーダル表現（音響、意味、および文脈）を蒸留するための組み合わせLMと自己教師あり音声モデル（SM）による蒸留技術。これらは、DM-Codecと呼ばれる包括的な音声トークナイザに蒸留されます。DM-Codecアーキテクチャは、Residual Vector Quantizer（RVQ）を備えたスムーズなエンコーダーデコーダーフレームワークを採用し、トレーニングプロセス中にLMとSMを組み込んでいます。実験結果は、DM-Codecが最先端の音声トークン化モデルを大幅に上回り、LibriSpeechベンチマークデータセットにおいてWERを最大13.46％、WILを9.82％削減し、音声品質を5.84％向上させ、理解可能性を1.85％向上させることを示しています。コード、サンプル、およびモデルのチェックポイントは、https://github.com/mubtasimahasan/DM-Codec で入手可能です。

English

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

DM-Codec: 発話トークン化のためのマルチモーダル表現の蒸留

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

要旨

Support