Mega-ASR:透過擴展真實世界聲學模擬實現野外二次方語音辨識
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
May 19, 2026
作者: Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao
cs.AI
摘要
儘管自動語音辨識(ASR)與大型音訊語言模型快速發展,在真實世界環境中的穩健辨識仍受到「聲學穩健性瓶頸」的限制:模型在嚴重且複合性的失真下,往往會失去聲學基礎,產生遺漏或幻覺。我們提出 Mega-ASR,一個統一的戶外 ASR 框架,結合可擴展的複合數據建構與漸進式聲學到語意最佳化。我們引入 Voices-in-the-Wild-2M 資料集,涵蓋 7 種經典聲學現象與 54 種物理可行的複合情境,並以聲學到語意的漸進式監督微調(Acoustic-to-Semantic Progressive Supervised Fine-Tuning)以及雙粒度 WER 門控策略最佳化(Dual-Granularity WER-Gated Policy Optimization)訓練 Mega-ASR。大量實驗證明,Mega-ASR 在惡劣條件 ASR 基準測試中,相較於先前最先進的系統具有顯著優勢(在 VOiCES R4-B-F 上為 45.69% 比 54.01%,在 NOIZEUS Sta-0 上為 21.49% 比 29.34%)。在複雜的複合聲學情境中,Mega-ASR 相較於強大的開源與閉源基線,進一步實現超過 30% 的相對詞錯誤率降低,為戶外穩健 ASR 建立了一個可擴展的典範。
English
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.