Mega-ASR: 실제 환경 음향 시뮬레이션 확장을 통한 In-the-wild² 음성 인식

초록

자동 음성 인식(ASR) 및 대규모 오디오-언어 모델의 급속한 발전에도 불구하고, 실제 환경에서의 강건한 인식은 "음향 강건성 병목 현상"에 의해 여전히 제한된다. 즉, 모델이 심각하고 복합적인 왜곡 하에서 음향적 근거를 상실하고 누락 또는 환각을 생성하는 경우가 빈번하다. 본 연구에서는 확장 가능한 복합 데이터 구축과 점진적인 음향-의미 최적화를 결합한 통합 현장 ASR 프레임워크인 Mega-ASR을 제안한다. 7가지 전형적인 음향 현상과 54가지 물리적으로 타당한 복합 시나리오를 포함하는 Voices-in-the-Wild-2M을 도입하고, 음향-의미 점진적 지도 미세 조정 및 이중 세분화 WER 게이트 정책 최적화를 통해 Mega-ASR을 학습시킨다. 광범위한 실험을 통해 Mega-ASR이 열악한 조건의 ASR 벤치마크(VOiCES R4-B-F에서 45.69% 대 54.01%, NOIZEUS Sta-0에서 21.49% 대 29.34%)에서 이전 최신 시스템보다 유의미한 우위를 달성함을 입증한다. 복잡한 복합 음향 시나리오에서 Mega-ASR은 강력한 오픈소스 및 폐쇄형 기준 모델 대비 30% 이상의 상대적 WER 감소를 추가로 제공하여, 현장 강건 ASR을 위한 확장 가능한 패러다임을 수립한다.

English

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.