TTS-STT飞轮效应:合成高密度实体音频填补商业与开源系统在印度语种ASR领域的短板
The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail
May 4, 2026
作者: Venkata Pushpak Teja Menta
cs.AI
摘要
当前,针对细分领域印度语言自动语音识别(ASR)——如数字串、货币金额、地址、品牌名及英印混合语——的开源SOTA模型与商用系统均存在服务不足问题。在合成的实体密集型泰卢固语测试集(由合成系统预留)上,vasista22/whisper-telugu-large-v2(开源SOTA)的实体命中率(EHR)为0.027,Deepgram Nova-3(商用系统)为0.16。我们通过自循环的TTS<->STT飞轮机制缩小了这一差距:利用开源印度语TTS流水线以低于50美元的边际成本合成约22,000条实体密集型印英混合语句,并在vasista22模型上进行LoRA微调,使预留测试集的EHR提升至0.473(较开源SOTA提升17倍,较商用系统提升3倍),同时在FLEURS-Te数据集上的朗读散文词错误率增幅控制在+6.6个百分点以内。跨语言测试显示:β版印地语模型EHR为0.337(较vasista22提升7倍),泰米尔语模型为0.543(较vasista22和Deepgram均提升22倍);但在Deepgram已有较好实体覆盖的印地语场景中,飞轮机制表现不及商用系统。三个β模型均未达到预注册EHR目标(泰卢固语0.75,印地语/泰米尔语0.65),我们如实报告结果。针对母语者录音的验证集(n=20条泰卢固语)证实了模型向真实语音的迁移能力(β版泰卢固语模型在母语数据上EHR为0.516,合成数据为0.473)。通过EDSA隔离消融实验(仅对FLEURS-Te进行LoRA微调)在相同预留集上获得EHR 0.020,表明性能提升几乎全部源于EDSA语料库。我们还发现语言条件性现象:原始Whisper-large-v3存在泰卢固语特有的文字崩溃现象(脚本识别率SFR为0.46-0.71),经分语言LoRA修正后SFR提升至0.81-0.97,但该方案对印地语和泰米尔语不适用(原始SFR≥0.98)。代码、预留集、预测结果、EDSA语料库及实体词典均已开源发布。
English
Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.