FunAudioLLM:用于人类与LLM之间自然交互的语音理解和生成基础模型
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
July 4, 2024
作者: Tongyi SpeechTeam
cs.AI
摘要
本报告介绍了FunAudioLLM,这是一个旨在增强人类与大型语言模型(LLMs)之间自然语音交互的模型系列。其核心包括两个创新模型:SenseVoice,负责多语音识别、情感识别和音频事件检测;以及CosyVoice,用于实现具有多语言、音色、说话风格和说话者身份控制的自然语音生成。SenseVoice-Small提供了5种语言的异常低延迟ASR,SenseVoice-Large支持50多种语言的高精度ASR,而CosyVoice在多语音生成、零样本上下文学习、跨语言语音克隆和遵循指令等方面表现出色。与SenseVoice和CosyVoice相关的模型已在Modelscope和Huggingface上开源,同时在GitHub上发布了相应的训练、推理和微调代码。通过将这些模型与LLMs集成,FunAudioLLM实现了诸如语音到语音翻译、情感语音聊天、互动播客和富有表现力的有声读物叙述等应用,从而推动了语音交互技术的边界。演示可在https://fun-audio-llm.github.io上找到,代码可在https://github.com/FunAudioLLM上访问。
English
This report introduces FunAudioLLM, a model family designed to enhance
natural voice interactions between humans and large language models (LLMs). At
its core are two innovative models: SenseVoice, which handles multilingual
speech recognition, emotion recognition, and audio event detection; and
CosyVoice, which facilitates natural speech generation with control over
multiple languages, timbre, speaking style, and speaker identity.
SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and
SenseVoice-Large supports high-precision ASR for over 50 languages, while
CosyVoice excels in multi-lingual voice generation, zero-shot in-context
learning, cross-lingual voice cloning, and instruction-following capabilities.
The models related to SenseVoice and CosyVoice have been open-sourced on
Modelscope and Huggingface, along with the corresponding training, inference,
and fine-tuning codes released on GitHub. By integrating these models with
LLMs, FunAudioLLM enables applications such as speech-to-speech translation,
emotional voice chat, interactive podcasts, and expressive audiobook narration,
thereby pushing the boundaries of voice interaction technology. Demos are
available at https://fun-audio-llm.github.io, and the code can be accessed at
https://github.com/FunAudioLLM.Summary
AI-Generated Summary