普遍的電話音声認識のための実証的レシピ

要旨

音声認識（PR）は多言語・低リソース音声処理タスクにおける重要な基盤技術であるが、頑健な性能達成は未だ困難な課題である。英語に特化した高精度モデルは他言語への汎化性に欠け、多言語モデルは事前学習された表現を十分に活用できていない。さらに、データ規模、アーキテクチャ、学習目標が多言語PRにどのように寄与するかも不明確である。本論文では、大規模多言語データで学習し、多言語音声（17.7% PFER）とアクセント付き英語音声（10.6% PFER）の両方でState-of-the-Art性能を達成するPhoneticXEUSを提案する。統一評価枠組による100言語超にわたる制御されたアブレーション実験を通じて、我々の学習レシピを実証的に確立し、SSL表現、データ規模、損失関数の影響を定量化する。さらに、言語族、アクセント付き音声、調音特徴にわたる誤りパターンを分析する。全てのデータとコードを公開する。

English

Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

普遍的電話音声認識のための実証的レシピ

An Empirical Recipe for Universal Phone Recognition

要旨

Support