T5Gemma-TTS技术报告
T5Gemma-TTS Technical Report
April 2, 2026
作者: Chihiro Arata, Kiyoshi Kurihara
cs.AI
摘要
自回归神经编解码语言模型已展现出强大的零样本语音克隆能力,但纯解码器架构将输入文本视为前缀,使其与持续增长的音频序列竞争位置容量,导致长语音的文本条件化效果减弱。我们提出T5Gemma-TTS——一种编码器-解码器架构的编解码语言模型,通过在每个解码层使用交叉注意力机制传递双向文本表征,从而保持持久的文本条件化。该模型基于T5Gemma预训练的编码器-解码器主干网络(20亿参数编码器+20亿参数解码器;共40亿参数),无需音素转换即可继承丰富的语言学知识,直接在子词级别处理文本。为提升时长控制能力,我们在全部26个交叉注意力层中引入进度监控旋转位置编码(PM-RoPE),通过注入归一化的进度信号帮助解码器追踪目标语音长度。使用17万小时英中日多语言语音数据训练后,T5Gemma-TTS在日语说话人相似度上较XTTSv2实现统计显著提升(0.677 vs 0.622;非重叠95%置信区间),对未参与训练的韩语也取得最高数值相似度(0.747),虽较XTTSv2(0.741)的领先优势未达统计显著性。在五个基线模型中,其日语字符错误率数值最低(0.126),但因与Kokoro存在部分置信区间重叠,需谨慎看待该排名。基于LibriSpeech的英语结果应视为上限估计,因LibriHeavy是LibriSpeech的超集。使用同一检查点时,在推理阶段禁用PM-RoPE会导致合成几乎完全失败:字符错误率从0.129恶化至0.982,时长准确率从79%降至46%。代码与权重已开源:https://github.com/Aratako/T5Gemma-TTS。
English
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.