ChatPaper.aiChatPaper

TaDiCodec:面向语音语言建模的文本感知扩散语音编码器

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

August 22, 2025
作者: Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu
cs.AI

摘要

语音分词器作为语音语言模型的基础组件,当前设计存在若干局限,包括:1)依赖多层残差向量量化结构或高帧率,2)需要辅助预训练模型进行语义蒸馏,3)要求复杂的双阶段训练流程。本研究中,我们提出了文本感知扩散变换器语音编解码器(TaDiCodec),旨在克服这些挑战。TaDiCodec通过扩散自编码器实现量化与重建的端到端优化,并在扩散解码器中融入文本指导,以提升重建质量并实现最优压缩。TaDiCodec在24kHz语音上,采用单层码本实现了极低的6.25Hz帧率及相应的0.0875kbps比特率,同时在关键语音生成评估指标如词错误率(WER)、说话人相似度(SIM)和语音质量(UTMOS)上保持优异表现。值得注意的是,TaDiCodec采用单阶段、端到端的训练范式,无需依赖辅助预训练模型。我们还验证了TaDiCodec在基于语言模型的零样本文本转语音任务中,与自回归建模和掩码生成建模的兼容性,展示了其在语音语言建模中的有效性与高效性,以及极小的重建-生成差距。我们将开源代码及模型检查点。音频样本可在https://tadicodec.github.io/获取。代码与模型检查点发布于https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer。
English
Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/. We release code and model checkpoints at https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer.
PDF21August 26, 2025