鱼类音频S2技术报告

摘要

我们推出Fish Audio S2，这是一款开源的文本转语音系统，具备多说话人、多轮生成能力，其核心特色在于通过自然语言描述实现指令跟随式控制。为扩大训练规模，我们开发了多阶段训练方案，并构建了涵盖视频字幕生成、语音字幕生成、音质评估与奖励建模的分阶段数据流水线。为推动开源TTS技术前沿发展，我们公开了模型权重、微调代码以及基于SGLang的推理引擎。该推理引擎具备生产级流式处理能力，实时因子达0.195，首音频生成时间低于100毫秒。代码与权重已发布于GitHub（https://github.com/fishaudio/fish-speech）和Hugging Face（https://huggingface.co/fishaudio/s2-pro）。诚邀读者访问https://fish.audio 体验定制语音功能。

English

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

鱼类音频S2技术报告

Fish Audio S2 Technical Report

摘要

Support