ChatPaper.aiChatPaper

MiniMax-Speech:內建零樣本文字轉語音系統,配備可學習的說話者編碼器

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

May 12, 2025
作者: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He
cs.AI

摘要

我們推出MiniMax-Speech,這是一款基於自迴歸Transformer架構的文本轉語音(TTS)模型,能夠生成高品質語音。其核心創新在於我們的可學習說話者編碼器,該編碼器能從參考音頻中提取音色特徵,而無需其轉錄文本。這使得MiniMax-Speech能夠以零樣本方式生成與參考音頻音色一致且極富表現力的語音,同時也支持一次樣本語音克隆,實現與參考聲音極高的相似度。此外,通過提出的Flow-VAE,合成音頻的整體質量得到了提升。我們的模型支持32種語言,並在多項客觀和主觀評估指標上展現出卓越性能。值得注意的是,它在客觀語音克隆指標(詞錯誤率和說話者相似度)上達到了業界領先水平(SOTA),並在公開的TTS Arena排行榜上位居榜首。MiniMax-Speech的另一大優勢,得益於說話者編碼器提供的強大且解耦的特徵表示,是其無需修改基礎模型即可擴展的能力,支持多種應用場景,例如:通過LoRA實現任意語音情感控制;通過直接從文本描述合成音色特徵實現文本到語音(T2V);以及通過額外數據微調音色特徵進行專業語音克隆(PVC)。我們鼓勵讀者訪問https://minimax-ai.github.io/tts_tech_report以獲取更多示例。
English
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.

Summary

AI-Generated Summary

PDF1003May 14, 2025