Vividh-ASR: Een complexiteitsgetrapte benchmark en optimalisatiedynamiek voor robuuste Indische spraakherkenning

Samenvatting

Het verfijnen van meertalige ASR-modellen zoals Whisper voor laagfrequente talen verbetert vaak voorgelezen spraak, maar verslechtert de prestaties op spontane audio, een fenomeen dat wij studio-bias noemen. Om deze mismatch te diagnosticeren introduceren we Vividh-ASR, een complexiteitsgestratificeerde benchmark voor Hindi en Malayalam in vier categorieën: studio, uitzending, spontaan en synthetische ruis. Door een gecontroleerde studie van leersnelheidstiming en curriculumvolgorde vinden we dat vroege grote parameterupdates de globale WER met 12 absolute punten verbeteren, terwijl een moeilijk-naar-makkelijk curriculum extra winst oplevert voor spontane spraak. Deze bevindingen motiveren omgekeerde meertraps fine-tuning (R-MFT), een trainingsrecept waarmee een parameter-efficiënt 244M Whisper-model conventioneel verfijnde 769M-tegenhangers evenaart of overtreft. Representatieanalyse via CKA en SVD laat zien dat effectieve schema's de aanpassing in de decoder concentreren, waardoor de akoestische geometrie van de vooraf getrainde encoder behouden blijft. We publiceren de benchmark en de modellen.

English

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.