普拉克斯语音:零商业训练数据成本下基于冻结非印度语基座的商用级印度语TTS实现——语音提示恢复与BUPS技术融合
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
April 28, 2026
作者: Venkata Pushpak Teja Menta
cs.AI
摘要
商业TTS系统可生成接近母语水平的印度语系音频,但最优开源基础模型(Chatterbox、Indic Parler-TTS、IndicF5)在语音维度评测中仍落后,且应用最广的多语言基础模型(Chatterbox,支持23种语言)甚至无法对泰卢固语和泰米尔语进行分词。我们提出:在不训练新声学解码器、不使用商业TTS训练数据的前提下,如何通过最小干预使这类非印度语系基础模型对泰卢固语、泰米尔语和印地语实现商业级输出?我们整合了三项技术:(1)BUPS——一个婆罗米系统一音素空间,可将七种印度文字确定性罗马化为ISO-15919格式,使Chatterbox的拉丁语分词器能够处理;(2)仅针对文本标记预测器(Chatterbox的t3模块)训练的LoRA适配器,使用约1,220小时授权印度语系音频数据,并采用印地语代理语言标识;(3)语音提示恢复方案——通过8-11秒同语言参考音频与三项采样参数重置(夸张度0.7、温度0.6、最小概率0.1;“配置B”),无需声学解码器训练即可恢复商业级声学输出。对于印地语,LoRA会降低准确率,故改用原始Chatterbox+配置B,形成双分支部署。通过配套PSP基准测试对10语句试点集评估,Praxy Voice达到或略超商业基线:泰卢固语卷舌音崩塌率26.7%(对比Sarvam Bulbul的33.3%),泰米尔语zha音崩塌率71%(对比商业三巨头的86%),印地语LLM-WER指标0.025(与Cartesia Sonic-3持平)。针对句内语码混合,我们新增第三分支(IndicF5+原生文字转写),将Hi/Te/Ta混合语句的LLM-WER从0.80-0.85降至0.14-0.27。我们开源了R6 LoRA权重(Apache-2.0协议)、推理代码与路由器(MIT协议)及Gradio演示界面。
English
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.