普拉克斯语音：零商业训练数据成本下基于冻结非印度语基座的商用级印度语TTS实现——语音提示恢复与BUPS技术融合

摘要

商业TTS系统可生成接近母语水平的印度语系音频，但最优开源基础模型（Chatterbox、Indic Parler-TTS、IndicF5）在语音维度评测中仍落后，且应用最广的多语言基础模型（Chatterbox，支持23种语言）甚至无法对泰卢固语和泰米尔语进行分词。我们提出：在不训练新声学解码器、不使用商业TTS训练数据的前提下，如何通过最小干预使这类非印度语系基础模型对泰卢固语、泰米尔语和印地语实现商业级输出？我们整合了三项技术：（1）BUPS——一个婆罗米系统一音素空间，可将七种印度文字确定性罗马化为ISO-15919格式，使Chatterbox的拉丁语分词器能够处理；（2）仅针对文本标记预测器（Chatterbox的t3模块）训练的LoRA适配器，使用约1,220小时授权印度语系音频数据，并采用印地语代理语言标识；（3）语音提示恢复方案——通过8-11秒同语言参考音频与三项采样参数重置（夸张度0.7、温度0.6、最小概率0.1；“配置B”），无需声学解码器训练即可恢复商业级声学输出。对于印地语，LoRA会降低准确率，故改用原始Chatterbox+配置B，形成双分支部署。通过配套PSP基准测试对10语句试点集评估，Praxy Voice达到或略超商业基线：泰卢固语卷舌音崩塌率26.7%（对比Sarvam Bulbul的33.3%），泰米尔语zha音崩塌率71%（对比商业三巨头的86%），印地语LLM-WER指标0.025（与Cartesia Sonic-3持平）。针对句内语码混合，我们新增第三分支（IndicF5+原生文字转写），将Hi/Te/Ta混合语句的LLM-WER从0.80-0.85降至0.14-0.27。我们开源了R6 LoRA权重（Apache-2.0协议）、推理代码与路由器（MIT协议）及Gradio演示界面。

English

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.