FunAudioLLM: Stemherkenning en -generatie foundation-modellen voor natuurlijke interactie tussen mensen en LLM's

Samenvatting

Dit rapport introduceert FunAudioLLM, een modelenfamilie ontworpen om natuurlijke steminteracties tussen mensen en grote taalmodellen (LLMs) te verbeteren. De kern bestaat uit twee innovatieve modellen: SenseVoice, dat zich bezighoudt met meertalige spraakherkenning, emotieherkenning en audio-gebeurtenisdetectie; en CosyVoice, dat natuurlijke spraakgeneratie mogelijk maakt met controle over meerdere talen, timbre, spreekstijl en sprekeridentiteit. SenseVoice-Small biedt uitzonderlijk lage latentie voor automatische spraakherkenning (ASR) in 5 talen, en SenseVoice-Large ondersteunt ASR met hoge precisie voor meer dan 50 talen, terwijl CosyVoice uitblinkt in meertalige stemgeneratie, zero-shot in-context leren, cross-linguale stemklonen en instructievolgcapaciteiten. De modellen gerelateerd aan SenseVoice en CosyVoice zijn open-source gemaakt op Modelscope en Huggingface, samen met de bijbehorende trainings-, inferentie- en fine-tuningcodes die zijn vrijgegeven op GitHub. Door deze modellen te integreren met LLMs, maakt FunAudioLLM toepassingen mogelijk zoals spraak-naar-spraakvertaling, emotionele stemchat, interactieve podcasts en expressieve audioboekvertellingen, waardoor de grenzen van steminteractietechnologie worden verlegd. Demo's zijn beschikbaar op https://fun-audio-llm.github.io, en de code is toegankelijk op https://github.com/FunAudioLLM.

English

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

FunAudioLLM: Stemherkenning en -generatie foundation-modellen voor natuurlijke interactie tussen mensen en LLM's

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Samenvatting

Support