FunAudioLLM：人間とLLMの自然なインタラクションのための音声理解・生成基盤モデル

要旨

本レポートでは、人間と大規模言語モデル（LLM）間の自然な音声インタラクションを強化するために設計されたモデルファミリー、FunAudioLLMを紹介します。その中核となるのは、2つの革新的なモデルです。1つは、多言語音声認識、感情認識、音声イベント検出を処理するSenseVoice、もう1つは、複数言語、音色、話し方、話者識別を制御しながら自然な音声生成を可能にするCosyVoiceです。SenseVoice-Smallは5言語での超低遅延ASRを実現し、SenseVoice-Largeは50言語以上での高精度ASRをサポートします。一方、CosyVoiceは多言語音声生成、ゼロショットインコンテキスト学習、クロスリンガル音声クローニング、指示追従能力に優れています。SenseVoiceとCosyVoiceに関連するモデルは、ModelscopeとHuggingfaceでオープンソース化されており、対応するトレーニング、推論、ファインチューニングのコードもGitHubで公開されています。これらのモデルをLLMと統合することで、FunAudioLLMは音声間翻訳、感情的音声チャット、インタラクティブポッドキャスト、表現豊かなオーディオブックナレーションなどのアプリケーションを可能にし、音声インタラクション技術の限界を押し広げています。デモはhttps://fun-audio-llm.github.ioで利用可能で、コードはhttps://github.com/FunAudioLLMでアクセスできます。

English

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

FunAudioLLM：人間とLLMの自然なインタラクションのための音声理解・生成基盤モデル

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

要旨

Support