邁向真正的多語言語音辨識：將語碼轉換語音辨識推廣至未見過的語言對

摘要

自動語音辨識（ASR）已成為人機互動的關鍵技術。然而，語碼轉換ASR（CS-ASR）仍面臨嚴峻挑戰，主因在於跨多種語言對的多語CS語音資源嚴重匱乏。現有方法主要透過生成合成CS語音，或針對有限雙語資料集進行特定語言對微調，來提升CS-ASR表現。然而，這類方法存在固有的可擴展性限制，因為支援CS必須針對各語言對分別開發，而語言對數量會隨支援語言數目呈組合數增長。本研究探討能否透過模型合併與領域泛化方法，將從有限觀察語言對所學得的CS能力，推廣至未觀察語言對。實驗結果顯示，合併後的雙語CS-ASR模型對未觀察語言對展現有限的泛化能力，表示雙語CS能力在不同語言對間的遷移效果有限。

English

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.