ChatPaper.aiChatPaper

FuseCodec:神經編解碼器語義上下文融合與監督機制

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

September 14, 2025
作者: Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
cs.AI

摘要

語音分詞技術實現了離散表示,並促進了語音語言模型的構建。然而,現有的神經編解碼器主要捕獲低層次的聲學特徵,忽視了人類語音中固有的語義和上下文線索。儘管近期研究嘗試從自監督語音模型中引入語義表示,或整合預訓練語言模型中的上下文表示,但在對齊和統一語義與上下文表示方面仍面臨挑戰。本文提出FuseCodec,通過強跨模態對齊和全局信息監督,統一了聲學、語義及上下文表示。我們提出了三種互補技術:(i) 潛在表示融合,將語義和上下文特徵直接整合至編碼器潛在空間,以實現魯棒且統一的表示學習;(ii) 全局語義-上下文監督,利用全局池化並廣播的表示來監督離散標記,增強時間一致性與跨模態對齊;(iii) 時間對齊的上下文監督,通過在局部窗口內動態匹配上下文與語音標記,強化對齊,實現細粒度的標記級監督。此外,我們還介紹了FuseCodec-TTS,展示了該方法在零樣本語音合成中的適用性。實驗表明,FuseCodec在LibriSpeech數據集上達到了最先進的性能,在轉錄準確率、感知質量、可懂度及說話人相似性方面均超越了EnCodec、SpeechTokenizer和DAC。結果凸顯了基於語境和語義指導的分詞技術在語音分詞及下游任務中的有效性。代碼及預訓練模型已公開於https://github.com/mubtasimahasan/FuseCodec。
English
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
PDF32September 17, 2025