Mega-ASR: 実世界音響シミュレーションのスケールアップによるIn-the-wild²音声認識の実現

要旨

自動音声認識（ASR）と大規模音声言語モデルは急速に進歩しているものの、現実環境でのロバストな認識は「音響ロバスト性のボトルネック」によって依然として限定的である。すなわち、深刻で複合的な歪み下では、モデルはしばしば音響的根拠を失い、欠落や幻覚を生じさせる。本稿では、スケーラブルな複合データ構築と段階的な音響-意味的最適化を組み合わせた統合的野外ASRフレームワーク「Mega-ASR」を提案する。我々は、7つの古典的音響現象と54の物理的に妥当な複合シナリオをカバーする「Voices-in-the-Wild-2M」を導入し、Mega-ASRを「音響-意味的段階的有監督ファインチューニング」および「二重粒度WERゲート政策最適化」により訓練する。広範な実験により、Mega-ASRは悪条件下のASRベンチマークにおいて従来の最先端システムを大きく上回ることを示す（VOiCES R4-B-Fで45.69%対54.01%、NOIZEUS Sta-0で21.49%対29.34%）。複雑な複合音響シナリオでは、Mega-ASRは強力なオープンソースおよびクローズドソースのベースラインに対し、相対WERを30%以上削減し、野外でのロバストなASRに向けたスケーラブルなパラダイムを確立する。

English

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.