FunAudioLLM: 인간과 대형 언어 모델 간의 자연스러운 상호작용을 위한 음성 이해 및 생성 기반 모델

초록

본 보고서는 인간과 대형 언어 모델(LLM) 간의 자연스러운 음성 상호작용을 강화하기 위해 설계된 FunAudioLLM 모델 패밀리를 소개합니다. 이 모델 패밀리의 핵심에는 두 가지 혁신적인 모델이 있습니다: 다국어 음성 인식, 감정 인식, 오디오 이벤트 탐지를 처리하는 SenseVoice와, 다국어, 음색, 말투, 화자 식별을 제어하며 자연스러운 음성 생성을 가능하게 하는 CosyVoice입니다. SenseVoice-Small은 5개 언어에 대해 매우 낮은 지연 시간의 자동 음성 인식(ASR)을 제공하며, SenseVoice-Large는 50개 이상의 언어에 대해 고정밀 ASR을 지원합니다. 한편, CosyVoice는 다국어 음성 생성, 제로샷 인컨텍스트 학습, 교차 언어 음성 복제, 명령 수행 능력에서 뛰어난 성능을 보입니다. SenseVoice와 CosyVoice 관련 모델은 Modelscope와 Huggingface에 오픈소스로 공개되었으며, 해당 학습, 추론, 미세 조정 코드는 GitHub에 공개되었습니다. 이러한 모델들을 LLM과 통합함으로써, FunAudioLLM은 음성 대 음성 번역, 감정적 음성 채팅, 인터랙티브 팟캐스트, 표현력 있는 오디오북 낭독과 같은 애플리케이션을 가능하게 하여 음성 상호작용 기술의 한계를 넓히고 있습니다. 데모는 https://fun-audio-llm.github.io에서 확인할 수 있으며, 코드는 https://github.com/FunAudioLLM에서 접근 가능합니다.

English

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

FunAudioLLM: 인간과 대형 언어 모델 간의 자연스러운 상호작용을 위한 음성 이해 및 생성 기반 모델

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

초록

Support