FuseCodec: Semantisch-contextuele fusie en supervisie voor neurale codecs

Samenvatting

Spraak-tokenisatie maakt discrete representatie mogelijk en vergemakkelijkt spraaktaalmodellering. Bestaande neurale codecs vangen echter laagniveau akoestische kenmerken op, waarbij de semantische en contextuele aanwijzingen die inherent zijn aan menselijke spraak over het hoofd worden gezien. Hoewel recente inspanningen semantische representaties hebben geïntroduceerd uit zelf-superviserende spraakmodellen of contextuele representaties hebben geïntegreerd uit vooraf getrainde taalmodellen, blijven er uitdagingen bestaan in het afstemmen en verenigen van de semantische en contextuele representaties. Wij introduceren FuseCodec, dat akoestische, semantische en contextuele representaties verenigt door sterke cross-modale afstemming en globaal geïnformeerde supervisie. Wij stellen drie complementaire technieken voor: (i) Latent Representation Fusion, waarbij semantische en contextuele kenmerken direct worden geïntegreerd in de latentie-ruimte van de encoder voor robuuste en verenigde representatieleer; (ii) Global Semantic-Contextual Supervision, waarbij discrete tokens worden gesuperviseerd met globaal gepoolde en uitgezonden representaties om temporele consistentie en cross-modale afstemming te verbeteren; en (iii) Temporally Aligned Contextual Supervision, waarbij de afstemming wordt versterkt door contextuele en spraaktokens dynamisch te matchen binnen een lokaal venster voor fijnmazige token-level supervisie. Wij introduceren verder FuseCodec-TTS, dat de toepasbaarheid van onze methodologie op zero-shot spraaksynthese aantoont. Empirisch gezien behaalt FuseCodec state-of-the-art prestaties in LibriSpeech, waarbij EnCodec, SpeechTokenizer en DAC worden overtroffen in transcriptienauwkeurigheid, perceptuele kwaliteit, verstaanbaarheid en spreker-gelijkenis. Resultaten benadrukken de effectiviteit van contextueel en semantisch geleide tokenisatie voor spraak-tokenisatie en downstream taken. Code en vooraf getrainde modellen zijn beschikbaar op https://github.com/mubtasimahasan/FuseCodec.

English

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

FuseCodec: Semantisch-contextuele fusie en supervisie voor neurale codecs

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Samenvatting

Support