TaDiCodec: 音声言語モデリングのためのテキスト認識拡散型音声トークナイザー

要旨

音声トークナイザーは音声言語モデルの基盤となる重要な構成要素であるが、現行の設計にはいくつかの課題が存在する。具体的には、1) 多層残差ベクトル量子化構造または高フレームレートへの依存、2) 意味的蒸留のための補助的な事前学習モデルへの依存、3) 複雑な二段階トレーニングプロセスの必要性などが挙げられる。本研究では、これらの課題を克服するために、Text-aware Diffusion Transformer Speech Codec (TaDiCodec) という新しいアプローチを提案する。TaDiCodecは、拡散オートエンコーダーを通じて量子化と再構成をエンドツーエンドで最適化し、拡散デコーダーにテキストガイダンスを統合することで、再構成品質を向上させ、最適な圧縮を実現する。TaDiCodecは、24 kHz音声に対して単層コードブックを用いて6.25 Hzという極めて低いフレームレートと0.0875 kbpsのビットレートを達成し、Word Error Rate (WER)、話者類似度 (SIM)、音声品質 (UTMOS) といった重要な音声生成評価指標において優れた性能を維持する。特に、TaDiCodecは単一段階のエンドツーエンドトレーニングパラダイムを採用し、補助的な事前学習モデルを必要としない。また、TaDiCodecの言語モデルベースのゼロショットテキストトゥスピーチにおける互換性を、自己回帰モデリングとマスク生成モデリングの両方で検証し、音声言語モデリングにおける有効性と効率性、および再構成と生成の間の極めて小さなギャップを実証する。コードとモデルチェックポイントをオープンソースとして公開する。音声サンプルはhttps:/tadicodec.github.io/で利用可能である。コードとモデルチェックポイントはhttps:/github.com/HeCheng0625/Diffusion-Speech-Tokenizerで公開する。

English

Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/. We release code and model checkpoints at https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer.

TaDiCodec: 音声言語モデリングのためのテキスト認識拡散型音声トークナイザー

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

要旨

Support