TTS-STTフライホイール：合成エンティティ高密度音声が、商用・オープンソースシステムの限界を超えてインド語ASRのギャップを解消する

要旨

専門領域のインド語ASR（数字列、通貨額、住所、ブランド名、英語・インド語コード混合）は、オープンソースのSOTAシステムと商用システムの両方で十分なサービスが提供されていない。合成システムによって確保された、実体情報が密集した合成テルグ語テストセットにおいて、vasista22/whisper-telugu-large-v2（オープンSOTA）はEntity-Hit-Rate（EHR）0.027、Deepgram Nova-3（商用）は0.16を達成した。我々は、自己完結型のTTS<->STTフライホイールによりこの差を解消する：オープンソースのインド語TTSパイプラインが約22,000の発話からなる実体情報密集のインド語-英語コード混合音声を50ドル未満の限界費用で合成し、vasista22を基にしたLoRAファインチューニングにより、確保されたテストセットでEHR 0.473（オープンSOTAの17倍、商用の3倍）を達成し、朗読散文に対する回帰はFLEURS-Teで+6.6 pp WER以内に抑えられた。他言語では、β-Hi 0.337（vasista22比7倍）、β-Ta 0.543（vasista22比22倍、Deepgram比22倍）となった。Deepgramが実体カバー率を大幅に有するヒンディー語では、フライホイールは商用システムを下回った。3つのβモデルは全て、事前登録されたEHR目標値（Te: 0.75, Hi/Ta: 0.65）を下回ったため、結果を正直に報告する。母語話者録音によるサニティチェック（n=20 テルグ語）は、実音声への転移を確認した（β-Te EHRは合成音声で0.473、母語話者音声で0.516）。EDSA分離アブレーション（FLEURS-TeのみでのLoRA）では、同じ確保されたテストセットでEHR 0.020となり、性能向上のほぼ100%がEDSAコーパスに起因すると判断された。追加で言語条件付きの知見を報告する：vanilla Whisper-large-v3はテルグ語特有のスクリプト崩壊（SFR 0.46-0.71）を示すが、これは言語毎のLoRAで修正可能（SFR 0.81-0.97）である。ただし、この手法はvanilla SFR >= 0.98のヒンディー語とタミル語では禁忌である。コード、確保データ、予測結果、EDSAコーパス、実体辞書はオープンソースとして公開された。

English

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

TTS-STTフライホイール：合成エンティティ高密度音声が、商用・オープンソースシステムの限界を超えてインド語ASRのギャップを解消する

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

要旨

Support