鱼类音频S2技术报告
Fish Audio S2 Technical Report
March 9, 2026
作者: Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han
cs.AI
摘要
我们推出Fish Audio S2,这是一款开源的文本转语音系统,具备多说话人、多轮生成能力,其核心特色在于通过自然语言描述实现指令跟随式控制。为扩大训练规模,我们开发了多阶段训练方案,并构建了涵盖视频字幕生成、语音字幕生成、音质评估与奖励建模的分阶段数据流水线。为推动开源TTS技术前沿发展,我们公开了模型权重、微调代码以及基于SGLang的推理引擎。该推理引擎具备生产级流式处理能力,实时因子达0.195,首音频生成时间低于100毫秒。代码与权重已发布于GitHub(https://github.com/fishaudio/fish-speech)和Hugging Face(https://huggingface.co/fishaudio/s2-pro)。诚邀读者访问https://fish.audio 体验定制语音功能。
English
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.