TTS-STT 플라이휠: 상용 및 오픈소스 시스템이 실패하는 인도 언어 ASR 격차를 합성 고밀도 엔티티 오디오가 해결하다

초록

니치 도메인 인디크 언어 자동 음성 인식(ASR)—숫자 문자열, 통화 금액, 주소, 브랜드 이름, 영어-인디크 코드 혼합 등—은 오픈소스 최신 기술(SOTA)과 상용 시스템 모두에서 충분히 지원되지 않고 있다. 합성 시스템으로부터 홀드아웃된, 개체가 밀집된 합성 텔루구어 테스트 세트에서 vasista22/whisper-telugu-large-v2(오픈 SOTA)는 개체 히트율(EHR) 0.027, Deepgram Nova-3(상용)은 0.16을 달성했다. 우리는 자체 포함형 TTS<->STT 플라이휠을 통해 이 격차를 해소했다: 오픈소스 인디크 TTS 파이프라인으로 약 22,000개의 개체 밀집 인디크-영어 코드 혼합 발화를 약 50달러 미만의 한계 비용으로 합성하였으며, vasista22 기반의 LoRA 미세 조정을 통해 홀드아웃 테스트에서 EHR 0.473(오픈 SOTA 대비 17배, 상용 대비 3배)을 달성했으며, FLEURS-Te에서 읽기 산문 회귀는 +6.6%p WER로 제한되었다. 교차 언어 결과: 베타-힌디 0.337(vasista22 대비 7배), 베타-타밀어 0.543(vasista22 대비 22배, Deepgram 대비 22배); 개체 커버리지가 상당한 힌디어에서는 플라이휠이 상용 시스템보다 낮은 성능을 보였다. 세 가지 베타 모델 모두 사전 등록된 EHR 목표치(텔루구어 0.75, 힌디어/타밀어 0.65)에 미치지 못했으며, 우리는 이를 정직하게 보고한다. 원어민이 녹음한 신뢰도 검증(n=20 텔루구어)은 실제 음성으로의 전이를 확인시켜 주었다(베타-텔루구어 EHR: 원어민 음성 0.516 vs 합성 음성 0.473). EDSA 분리 어블레이션(FLEURS-Te만으로 LoRA 조정)은 동일한 홀드아웃 세트에서 EHR 0.020을 보여, 성능 향상의 약 100%가 EDSA 코퍼스에 기인함을 확인했다. 추가로 언어 조건별 발견 사항을 보고한다: 기본 Whisper-large-v3는 텔루구어 특이적 문자 붕괴(SFR 0.46-0.71)를 보이는데, 언어별 LoRA가 이를 교정하지만(SFR 0.81-0.97), 이 방법은 기본 SFR >= 0.98인 힌디어와 타밀어에서는 오히려 권장되지 않는다. 코드, 홀드아웃 세트, 예측 결과, EDSA 코퍼스 및 개체 사전은 오픈소스로 공개되었다.

English

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

TTS-STT 플라이휠: 상용 및 오픈소스 시스템이 실패하는 인도 언어 ASR 격차를 합성 고밀도 엔티티 오디오가 해결하다

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

초록

Support