범용 음소 인식을 위한 실증적 레시피

초록

음성 인식(PR)은 다국어 및 저자원 음성 처리 작업의 핵심 동력이지만, 강건한 성능 달성은 여전히 어려운 과제입니다. 영어 중심의 고성능 모델은 다양한 언어로 일반화되지 않으며, 다국어 모델은 사전 학습된 표현을 충분히 활용하지 못합니다. 또한 데이터 규모, 아키텍처, 학습 목표가 다국어 PR에 어떻게 기여하는지도 명확하지 않습니다. 본 논문은 대규모 다국어 데이터로 학습되어 다국어(17.7% PFER) 및 액센트 영어 음성(10.6% PFER) 모두에서 최첨단 성능을 달성하는 PhoneticXEUS를 제시합니다. 통합된 평가 체계 아래 100개 이상의 언어에서 수행한 체계적인 제어 실험을 통해, 우리는 학습 방법론을 실증적으로 입증하고 SSL 표현, 데이터 규모, 손실 함수의 영향을 정량화합니다. 또한 언어 계열, 액센트 음성, 조음 특징에 따른 오류 패턴을 분석합니다. 모든 데이터와 코드는 공개되었습니다.

English

Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

범용 음소 인식을 위한 실증적 레시피

An Empirical Recipe for Universal Phone Recognition

초록

Support