上下文學習提升語音辨識:實現類人化的說話者與語言變體適應
In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
May 20, 2025
作者: Nathan Roll, Calbert Graham, Yuka Tatsumi, Kim Tien Nguyen, Meghan Sumner, Dan Jurafsky
cs.AI
摘要
人類聽眾能輕易透過接觸來適應不熟悉的說話者及語言變體,但這種適應優勢是否也能延伸至最先進的語音語言模型?我們提出了一個可擴展的框架,使Phi-4多模態模型能夠利用交錯的任務提示與音頻-文本對進行上下文學習(ICL),並發現僅需在推理時提供12個示例語句(約50秒),即可在多元英語語料庫上平均相對降低19.7%(1.2個百分點)的詞錯誤率。這些改進在低資源變體中尤為顯著,當上下文與目標說話者匹配且提供更多示例時效果最佳——儘管擴展我們的程序會導致上下文長度的邊際收益遞減。總體而言,我們發現新穎的ICL適應方案(1)展現出與人類聽眾相似的性能特徵,(2)在自動語音識別(ASR)的魯棒性上,對不同說話者及語言背景均表現出持續的改進。雖然適應廣泛成功,但對於某些變體仍存在顯著差距,揭示了當前模型在靈活性上仍不及人類之處。我們已在GitHub上發布了提示與代碼。
English
Human listeners readily adjust to unfamiliar speakers and language varieties
through exposure, but do these adaptation benefits extend to state-of-the-art
spoken language models? We introduce a scalable framework that allows for
in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts
and audio-text pairs, and find that as few as 12 example utterances (~50
seconds) at inference time reduce word error rates by a relative 19.7% (1.2
pp.) on average across diverse English corpora. These improvements are most
pronounced in low-resource varieties, when the context and target speaker
match, and when more examples are provided--though scaling our procedure yields
diminishing marginal returns to context length. Overall, we find that our novel
ICL adaptation scheme (1) reveals a similar performance profile to human
listeners, and (2) demonstrates consistent improvements to automatic speech
recognition (ASR) robustness across diverse speakers and language backgrounds.
While adaptation succeeds broadly, significant gaps remain for certain
varieties, revealing where current models still fall short of human
flexibility. We release our prompts and code on GitHub.Summary
AI-Generated Summary