Fish Audio S2 기술 보고서

초록

본 논문에서는 자연어 설명을 통한 지시어 추종 제어가 가능한 오픈소스 텍스트-음성 변환 시스템인 Fish Audio S2를 소개한다. 이 시스템은 다중 화자 및 다중 턴 생성 기능을 갖추고 있다. 대규모 학습을 위해 비디오 캡셔닝, 음성 캡셔닝, 음질 평가, 보상 모델링을 포함하는 단계적 데이터 파이프라인과 다단계 학습 방법론을 개발하였다. 오픈소스 TTS 기술의 발전을 위해 모델 가중치, 미세 조정 코드, SGLang 기반 추론 엔진을 공개한다. 해당 추론 엔진은 스트리밍 환경에서 즉시 적용 가능한 수준으로, RTF 0.195, 첫 음성 출력까지 100ms 미만의 성능을 달성하였다. 코드와 가중치는 GitHub(https://github.com/fishaudio/fish-speech)와 Hugging Face(https://huggingface.co/fishaudio/s2-pro)에서 이용할 수 있으며, 독자들은 https://fish.audio에서 사용자 정의 음성을 직접 체험해 볼 것을 권장한다.

English

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Fish Audio S2 기술 보고서

Fish Audio S2 Technical Report

초록

Support