ChatPaper.aiChatPaper

FuseCodec:面向神经编解码器的语义-上下文融合与监督机制

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

September 14, 2025
作者: Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
cs.AI

摘要

语音分词技术实现了离散化表示,并促进了语音语言建模。然而,现有的神经编解码器主要捕捉低层次的声学特征,忽视了人类语音中固有的语义和上下文线索。尽管近期研究引入了自监督语音模型的语义表示或整合了预训练语言模型的上下文表示,但在对齐和统一语义与上下文表示方面仍面临挑战。我们提出了FuseCodec,通过强大的跨模态对齐和全局信息监督,统一了声学、语义和上下文表示。我们提出了三种互补技术:(i) 潜在表示融合,将语义和上下文特征直接整合到编码器的潜在空间中,以实现稳健且统一的表示学习;(ii) 全局语义-上下文监督,通过全局池化和广播表示来监督离散标记,以增强时间一致性和跨模态对齐;(iii) 时间对齐的上下文监督,通过在局部窗口内动态匹配上下文和语音标记,加强对齐,实现细粒度的标记级监督。我们进一步推出了FuseCodec-TTS,展示了该方法在零样本语音合成中的适用性。实验表明,FuseCodec在LibriSpeech数据集上实现了最先进的性能,在转录准确性、感知质量、可懂度和说话人相似度方面均超越了EnCodec、SpeechTokenizer和DAC。结果凸显了基于上下文和语义引导的分词技术在语音分词及下游任务中的有效性。代码和预训练模型可在https://github.com/mubtasimahasan/FuseCodec获取。
English
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
PDF32September 17, 2025