ChatPaper.aiChatPaper

MiniMax-Speech:具备可学习说话人编码器的本征零样本文本转语音系统

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

May 12, 2025
作者: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He
cs.AI

摘要

我们推出MiniMax-Speech,这是一款基于自回归Transformer架构的文本转语音(TTS)模型,能够生成高质量语音。其核心创新在于可学习的说话人编码器,该编码器无需参考音频的转录文本即可从中提取音色特征。这使得MiniMax-Speech能够以零样本方式生成与参考音频音色一致且极具表现力的语音,同时支持一次性语音克隆,实现与参考声音极高的相似度。此外,通过引入Flow-VAE,我们进一步提升了合成音频的整体质量。该模型支持32种语言,并在多项客观与主观评价指标上展现出卓越性能。特别是在客观语音克隆指标(如词错误率和说话人相似度)上,MiniMax-Speech达到了业界领先水平,并在公开的TTS竞技场排行榜上位居榜首。得益于说话人编码器提供的鲁棒且解耦的表征能力,MiniMax-Speech的另一大优势在于其无需修改基础模型即可扩展,支持多种应用场景,例如:通过LoRA实现任意语音情感控制;直接从文本描述合成音色特征,实现文本到语音(T2V);以及通过额外数据微调音色特征,进行专业语音克隆(PVC)。我们鼓励读者访问https://minimax-ai.github.io/tts_tech_report以获取更多示例。
English
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.

Summary

AI-Generated Summary

PDF973May 14, 2025