T5Gemma-TTS 技術報告書

要旨

自己回帰型ニューラルコデック言語モデルは強力なゼロショット音声クローニング能力を示すが、デコーダのみのアーキテクチャでは入力テキストが接頭辞として扱われ、増加する音声シーケンスと位置符号化容量を競合するため、長い発話ではテキスト条件付けが弱体化する。本論文では、エンコーダ-デコーダ型コデック言語モデルであるT5Gemma-TTSを提案する。これはすべてのデコーダ層で双方向テキスト表現をクロスアテンション経由でルーティングすることで、持続的なテキスト条件付けを維持する。T5Gemma事前学習済みエンコーダ-デコーダバックボーン（エンコーダ20億パラメータ＋デコーダ20億パラメータ、合計40億パラメータ）を基盤としており、音素変換なしで豊富な言語知識を継承し、テキストをサブワードレベルで直接処理する。発話長制御を改善するため、26層すべてのクロスアテンションに進度監視型ロータリ位置埋め込み（PM-RoPE）を導入し、正規化された進度信号を注入することでデコーダが目標音声長を追跡できるようにした。英語、中国語、日本語の17万時間の多言語音声で学習し、T5Gemma-TTSは日本語話者類似度でXTTSv2を統計的有意に上回り（0.677対0.622、95%信頼区間は重複せず）、学習データに含まれない韓国語においても最高の数値的な話者類似度（0.747）を達成した（ただしXTTSv2の0.741に対する優位性は統計的に決定的ではない）。また、5つのベースライン中で最低の日本語文字誤り率（0.126）を示したが、Kokoroとの信頼区間が一部重複するため、この順位は注意して解釈すべきである。LibriSpeechにおける英語結果は、LibriHeavyがLibriSpeechのスーパーセットであるため上限推定値と見なすべきである。同一チェックポイントで推論時にPM-RoPEを無効化すると合成はほぼ完全に失敗し：文字誤り率は0.129から0.982に悪化、発話長精度は79%から46%に低下した。コードと重みはhttps://github.com/Aratako/T5Gemma-TTS で公開している。

English

Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.

T5Gemma-TTS 技術報告書

T5Gemma-TTS Technical Report

要旨

Support