Fish Audio S2 技術報告書

要旨

我々はFish Audio S2を紹介する。これはオープンソースのテキスト音声合成システムであり、マルチスピーカー対応、マルチターン生成を特徴とし、最も重要な点として自然言語記述による指示追従制御が可能である。トレーニングを効率化するため、動画キャプション生成と音声キャプション生成、音声品質評価、報酬モデリングを網羅した段階的データパイプラインと、多段階トレーニング手法を開発した。オープンソースTTSの最先端を推進するため、モデル重み、ファインチューニングコード、およびSGLangベースの推論エンジンを公開する。当推論エンジンはストリーミング対応のプロダクションレディ仕様であり、RTF 0.195、初音声出力までの待ち時間100ミリ秒未満を達成している。コードと重みはGitHub（https://github.com/fishaudio/fish-speech）とHugging Face（https://huggingface.co/fishaudio/s2-pro）で公開中。カスタム音声の体験はぜひ https://fish.audio へアクセスされたい。

English

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Fish Audio S2 技術報告書

Fish Audio S2 Technical Report

要旨

Support