Mega-ASR：透過擴展真實世界聲學模擬實現野外二次方語音辨識

摘要

儘管自動語音辨識（ASR）與大型音訊語言模型快速發展，在真實世界環境中的穩健辨識仍受到「聲學穩健性瓶頸」的限制：模型在嚴重且複合性的失真下，往往會失去聲學基礎，產生遺漏或幻覺。我們提出 Mega-ASR，一個統一的戶外 ASR 框架，結合可擴展的複合數據建構與漸進式聲學到語意最佳化。我們引入 Voices-in-the-Wild-2M 資料集，涵蓋 7 種經典聲學現象與 54 種物理可行的複合情境，並以聲學到語意的漸進式監督微調（Acoustic-to-Semantic Progressive Supervised Fine-Tuning）以及雙粒度 WER 門控策略最佳化（Dual-Granularity WER-Gated Policy Optimization）訓練 Mega-ASR。大量實驗證明，Mega-ASR 在惡劣條件 ASR 基準測試中，相較於先前最先進的系統具有顯著優勢（在 VOiCES R4-B-F 上為 45.69% 比 54.01%，在 NOIZEUS Sta-0 上為 21.49% 比 29.34%）。在複雜的複合聲學情境中，Mega-ASR 相較於強大的開源與閉源基線，進一步實現超過 30% 的相對詞錯誤率降低，為戶外穩健 ASR 建立了一個可擴展的典範。

English

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.