TTS-STT飞轮效应：合成高密度实体音频填补商业与开源系统在印度语言ASR领域的空白

摘要

專用領域印度語言自動語音識別——包括數字串、貨幣金額、地址、品牌名稱及印英語碼混合內容——在開源先進技術和商業系統中均未得到充分支持。在一個合成的實體密集型泰盧固語測試集（通過合成系統預留）上，vasista22/whisper-telugu-large-v2（開源先進技術）實現實體命中率0.027，而Deepgram Nova-3（商業系統）為0.16。我們通過自包含的TTS<->STT飛輪機制縮小了這一差距：開源印度語言TTS流水線以低於50美元的邊際成本合成約22,000條實體密集型印英語碼混合語句，基於vasista22的LoRA微調在預留測試集上實現實體命中率0.473（較開源先進技術提升17倍，較商業系統提升3倍），同時在FLEURS-Te數據集上將朗讀散文的識別錯誤率回歸控制在+6.6個百分點以內。跨語言測試結果：印地語β版0.337（較vasista22提升7倍），泰米爾語β版0.543（較vasista22和Deepgram均提升22倍）；在Deepgram具有顯著實體覆蓋度的印地語場景中，飛輪機制表現不及商業系統。三個β版模型均未達到預註冊的實體命中率目標（泰盧固語0.75，印地語/泰米爾語0.65），我們如實匯報。母語人士錄製的驗證測試（n=20泰盧固語）證實技術可遷移至真實語音（β版泰盧固語在真實語音的實體命中率為0.516，合成語音為0.473）。通過EDSA隔離消融實驗（僅使用FLEURS-Te的LoRA訓練）在相同預留集上實體命中率為0.020，表明性能增益幾乎全部來自EDSA語料庫。我們還發現語言條件性現象：原始Whisper-large-v3存在泰盧固語專用的文字崩潰問題（文字恢復率0.46-0.71），可通過分語言LoRA修正（文字恢復率0.81-0.97），但該方法對文字恢復率≥0.98的印地語和泰米爾語不適用。代碼、預留集、預測結果、EDSA語料庫及實體詞典均已開源發布。

English

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

TTS-STT飞轮效应：合成高密度实体音频填补商业与开源系统在印度语言ASR领域的空白

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

摘要

Support