POWSM:一种基于开放语音风格的语音基础模型
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
October 28, 2025
作者: Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe
cs.AI
摘要
近期口语处理技术的突破性进展,显著提升了自动语音识别(ASR)、音素识别(PR)、字形-音素转换(G2P)及音素-字形转换(P2G)等语音任务的性能。尽管这些任务在概念上具有相似性,但现有研究大多孤立进行,各自依赖特定架构与数据集。本文提出POWSM(语音开放式Whisper风格模型),这是首个能协同执行多种音素相关任务的统一框架。该框架实现了音频、文本(字形)与音素间的无缝转换,为通用化及低资源语音处理开辟了新路径。实验表明,在保持相似模型规模的前提下,POWSM性能优于或匹配专用PR模型(Wav2Vec2Phoneme与ZIPA),同时兼具G2P、P2G与ASR功能。为促进开放科学,我们已公开训练数据、代码及模型。
English
Recent advances in spoken language processing have led to substantial
progress in phonetic tasks such as automatic speech recognition (ASR), phone
recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme
conversion (P2G). Despite their conceptual similarity, these tasks have largely
been studied in isolation, each relying on task-specific architectures and
datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech
Model), the first unified framework capable of jointly performing multiple
phone-related tasks. POWSM enables seamless conversion between audio, text
(graphemes), and phones, opening up new possibilities for universal and
low-resource speech processing. Our model outperforms or matches specialized PR
models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P,
P2G, and ASR. Our training data, code and models are released to foster open
science.