ChatPaper.aiChatPaper

PRiSM:语音模型中电话实现基准测试

PRiSM: Benchmarking Phone Realization in Speech Models

January 20, 2026
作者: Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim, Kwanghee Choi, Eunjung Yeo, Ryan Soh-Eun Shim, Hanyu Zhou, Brendon Boldt, Karen Rosero Jacome, Kalvin Chang, Darsh Agrawal, Keer Xu, Chao-Han Huck Yang, Jian Zhu, Shinji Watanabe, David R. Mortensen
cs.AI

摘要

音素识别(PR)作为跨语言语音处理和音系分析的语言无关建模基础接口。尽管音素识别系统的研发历经长期努力,当前评估仅衡量表层转写准确度。我们推出PRiSM——首个通过音素识别系统内在与外在评估揭示音系感知盲点的开源基准。该基准标准化了基于转写的评估体系,并通过转写与表征探针评估临床、教育及多语场景的下游效用。研究发现:训练过程中的多语言接触是提升音素识别性能的关键,编码器-CTC模型具有最佳稳定性,专业音素识别模型仍优于大型音频语言模型。PRiSM开源代码、训练方案及数据集,推动领域向具备强健音系能力的多语言语音模型发展:https://github.com/changelinglab/prism。
English
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: https://github.com/changelinglab/prism.
PDF52January 22, 2026