普拉克斯语音:零商业训练数据成本下基于冻结非印度语基座的商用级印度语TTS实现——语音提示恢复与BUPS技术融合方案
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
April 28, 2026
作者: Venkata Pushpak Teja Menta
cs.AI
摘要
商用TTS系统能生成接近母语水平的印度语系音频,但顶尖开源基础模型(Chatterbox、Indic Parler-TTS、IndicF5)在音系维度评测中仍存差距,其中应用最广的多语言基础模型(Chatterbox支持23种语言)甚至无法对泰卢固语和泰米尔语进行分词。我们探究:在不训练新声学解码器、不使用任何商用TTS训练数据的前提下,需要何种最小干预能使这类非印度语系基础模型对泰卢固语、泰米尔语和印地语产生商用级输出?我们整合了三项技术:(1)BUPS——一个婆罗米系统一音素空间,能将七种印度文字确定性罗马化为ISO-15919格式,使Chatterbox的拉丁语分词器可处理;(2)仅针对文本标记预测器(Chatterbox的t3模块)训练的LoRA适配器,使用约1,220小时授权印度语系音频数据,并采用印地语代理语言标识;(3)语音提示恢复方案——通过8-11秒同语言参考片段与三项采样参数重置(夸张度0.7、温度0.6、最小概率0.1,即"配置B"),无需声学解码器训练即可恢复商用级声学输出。对于印地语,LoRA会降低准确率,故改用原始Chatterbox+配置B方案,形成双分支部署。通过配套PSP基准测试对10语句试点集评估,Praxy Voice达到或略超商用基线:泰卢固语卷舌音塌陷率26.7%(对比Sarvam Bulbul的33.3%),泰米尔语zha音塌陷率71%(对比商用三组系统的86%),印地语LLM-WER指标0.025(与Cartesia Sonic-3持平)。针对句内语码混合场景,我们新增第三分支(IndicF5+原生文字转写方案),将印地语/泰卢固语/泰米尔语的语码混合LLM-WER从0.80-0.85降至0.14-0.27。我们开源了R6版LoRA权重(Apache-2.0协议)、推理代码与路由器(MIT协议)以及Gradio演示界面。
English
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.