プラクシーボイス：音声プロンプトによる回復＋BUPSを用いた、商用レベルのインド系言語TTSを凍結した非インド系ベースからゼロ商用トレーニングデータコストで実現

要旨

商用TTSシステムはネイティブに近いインド系言語音声を生成するが、最高のオープンソース基盤（Chatterbox、Indic Parler-TTS、IndicF5）は音韻次元の測定値で劣り、最も広く採用されている多言語基盤（Chatterbox、23言語）はテルグ語やタミル語のトークン化すら行わない。本研究では、新しい音響デコーダの学習や商用TTS学習データを一切用いずに、こうした非インド系言語基盤をテルグ語・タミル語・ヒンディー語で商用レベルに引き上げる最小限の介入手法を探る。3つの要素を組み合わせる：（1）BUPS（ブラーフミー系統合音素空間）―7種のインド系文字をISO-15919に決定論的ローマ字化しChatterboxのラテントークナイザで処理可能にする（2）テキストトークン予測器（Chatterboxのt3）のみに適用するLoRAアダプタ―ヒンディー語代理のlanguage_idで約1,220時間の許諾済みインド系音声を学習（3）音声プロンプト復元レシピ―8-11秒の同一言語参照クリップと3つのサンプリング設定（exaggeration 0.7, temperature 0.6, min_p 0.1;「設定B」）で音響デコーダ学習なしに商用級音声を復元。ヒンディー語ではLoRAが精度を低下させるため、代わりに標準Chatterbox+設定Bを用い、2分岐構成を採用。付属のPSPベンチマークによる10発話パイロット評価では、Praxy Voiceは商用ベースラインを同等か僅かに上回る：テルグ語反舌音崩壊率26.7%（Sarvam Bulbul 33.3%対）、タミル語zh音崩壊率71%（商用3社平均86%対）、ヒンディー語LLM-WER 0.025（Cartesia Sonic-3と同値）。文中コード混在には第3分岐（IndicF5＋原文字転写）を追加し、Hi/Te/Ta混在文のLLM-WERを0.80-0.85から0.14-0.27に低減。R6 LoRA重み（Apache-2.0）、推論コード・ルーター（MIT）、Gradioデモを公開。

English

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

プラクシーボイス：音声プロンプトによる回復＋BUPSを用いた、商用レベルのインド系言語TTSを凍結した非インド系ベースからゼロ商用トレーニングデータコストで実現

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

要旨

Support