迈向真正的多语言ASR：将语码转换ASR泛化到未见过的语言对

摘要

自动语音识别（ASR）已成为人机交互的关键技术。然而，语码转换语音识别（CS-ASR）仍面临特殊挑战，主要原因在于不同语言对之间多语言语码转换语音资源的严重匮乏。现有方法主要通过合成语码转换语音生成或在有限双语数据集上进行针对特定语言对的微调来提升CS-ASR性能。然而，这些方法存在固有的可扩展性限制，因为对语码转换的支持必须针对不同语言对单独开发，而语言对的数量会随支持的语言种类呈组合增长。在本研究中，我们探究通过模型合并和领域泛化方法，从有限的已见语言对中学到的语码转换能力是否能够泛化到未见语言对。实验表明，合并后的双语CS-ASR模型对未见语言对的泛化能力有限，这表明双语语码转换能力在不同语言对之间的迁移较为有限。

English

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.