Mega-ASR: 通过扩展真实声学模拟实现真实环境²语音识别

摘要

尽管自动语音识别（ASR）与大型音频语言模型取得了快速进展，但在真实环境下的鲁棒识别仍受限于“声学鲁棒性瓶颈”：模型在严重且复合的失真条件下常常丢失声学基础，产生遗漏或幻觉。我们提出Mega-ASR，一种统一的野外ASR框架，该框架结合了可扩展的复合数据构建与渐进式声学到语义优化。我们引入了Voices-in-the-Wild-2M数据集，涵盖7种经典声学现象与54种物理上合理的复合场景，并通过声学到语义渐进式监督微调与双粒度词错误率门控策略优化来训练Mega-ASR。大量实验表明，在恶劣条件ASR基准上，Mega-ASR相比先前最优系统取得显著优势（VOiCES R4-B-F上45.69%对54.01%，NOIZEUS Sta-0上21.49%对29.34%）。在复杂复合声学场景中，Mega-ASR相比强大的开源与闭源基线进一步实现了超过30%的相对词错误率降低，为野外鲁棒ASR建立了一个可扩展的范式。

English

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.