ChatPaper.aiChatPaper

上下文学习通过类人化适应说话者与语言变体,提升语音识别性能

In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

May 20, 2025
作者: Nathan Roll, Calbert Graham, Yuka Tatsumi, Kim Tien Nguyen, Meghan Sumner, Dan Jurafsky
cs.AI

摘要

人类听者能够通过接触迅速适应不熟悉的说话者和语言变体,但这种适应优势是否也适用于最先进的语音语言模型?我们引入了一个可扩展的框架,该框架允许在Phi-4多模态模型中使用交错任务提示和音频-文本对进行上下文学习(ICL),并发现仅需在推理时提供12个示例话语(约50秒),即可在多样化的英语语料库上平均相对降低19.7%(1.2个百分点)的词错误率。这些改进在低资源变体中最为显著,当上下文与目标说话者匹配时,以及提供更多示例时——尽管扩展我们的程序会带来对上下文长度的边际收益递减。总体而言,我们发现我们的新颖ICL适应方案(1)展现出与人类听者相似的性能特征,并且(2)在多样化的说话者和语言背景中,对自动语音识别(ASR)的鲁棒性表现出了一致的提升。尽管适应在广泛范围内取得成功,但对于某些变体仍存在显著差距,揭示了当前模型在灵活性方面仍不及人类。我们在GitHub上发布了我们的提示和代码。
English
Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.

Summary

AI-Generated Summary

PDF02May 23, 2025