TaDiCodec:面向語音語言建模的文本感知擴散語音標記器
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
August 22, 2025
作者: Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu
cs.AI
摘要
語音分詞器作為語音語言模型的基礎組件,現有設計存在多項侷限,包括:1)依賴於多層殘差向量量化結構或高幀率,2)需要輔助預訓練模型進行語意蒸餾,3)要求複雜的兩階段訓練過程。本研究提出了一種新方法——文本感知擴散變壓器語音編解碼器(TaDiCodec),旨在克服這些挑戰。TaDiCodec通過擴散自編碼器實現量化與重建的端到端優化,並將文本指導整合至擴散解碼器中,以提升重建質量並實現最佳壓縮。對於24 kHz的語音,TaDiCodec在僅使用單層碼本的情況下,達到了極低的6.25 Hz幀率及相應的0.0875 kbps比特率,同時在關鍵語音生成評估指標如詞錯誤率(WER)、說話人相似度(SIM)及語音質量(UTMOS)上保持優異表現。值得注意的是,TaDiCodec採用單階段、端到端的訓練範式,無需依賴輔助預訓練模型。我們還驗證了TaDiCodec在基於語言模型的零樣本文本轉語音中,無論是自迴歸建模還是掩碼生成建模中的兼容性,展示了其在語音語言建模中的高效性與有效性,以及極小的重建-生成差距。我們將開源代碼及模型檢查點。音頻樣本可於https:/tadicodec.github.io/獲取。代碼與模型檢查點發佈於https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer。
English
Speech tokenizers serve as foundational components for speech language
models, yet current designs exhibit several limitations, including: 1)
dependence on multi-layer residual vector quantization structures or high frame
rates, 2) reliance on auxiliary pre-trained models for semantic distillation,
and 3) requirements for complex two-stage training processes. In this work, we
introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a
novel approach designed to overcome these challenges. TaDiCodec employs
end-to-end optimization for quantization and reconstruction through a diffusion
autoencoder, while integrating text guidance into the diffusion decoder to
enhance reconstruction quality and achieve optimal compression. TaDiCodec
achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of
0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining
superior performance on critical speech generation evaluation metrics such as
Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).
Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and
obviating the need for auxiliary pre-trained models. We also validate the
compatibility of TaDiCodec in language model based zero-shot text-to-speech
with both autoregressive modeling and masked generative modeling, demonstrating
its effectiveness and efficiency for speech language modeling, as well as a
significantly small reconstruction-generation gap. We will open source our code
and model checkpoints. Audio samples are are available at
https:/tadicodec.github.io/. We release code and model checkpoints at
https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer.