迷失於混雜之中:評估大型語言模型對語碼轉換文本的理解能力
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
June 16, 2025
作者: Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang
cs.AI
摘要
代碼轉換(Code-switching, CSW)是指在單一對話中交替使用兩種或更多語言的現象。這一現象在多語言社群中極為普遍,且在網路內容中日益常見,用戶在日常交流中自然地混合使用多種語言。因此,作為內容處理與生成核心的大型語言模型(LLMs)經常接觸到代碼轉換的輸入。鑑於其廣泛應用,理解LLMs如何處理並推理此類混合語言文本至關重要。本文通過生成代碼轉換版本的推理與理解基準測試,系統性地評估了LLM在代碼轉換下的理解能力。研究發現,當外語詞彙打斷英語文本時,即使存在語言學上的限制,理解能力仍會明顯下降;然而,將英語嵌入其他語言中卻常能提升理解效果。雖然提示策略的效果參差不齊,但微調提供了一條更為穩定的途徑來減輕理解能力的下降。
English
Code-switching (CSW) is the act of alternating between two or more languages
within a single discourse. This phenomenon is widespread in multilingual
communities, and increasingly prevalent in online content, where users
naturally mix languages in everyday communication. As a result, Large Language
Models (LLMs), now central to content processing and generation, are frequently
exposed to code-switched inputs. Given their widespread use, it is crucial to
understand how LLMs process and reason about such mixed-language text. This
paper presents a systematic evaluation of LLM comprehension under
code-switching by generating CSW variants of established reasoning and
comprehension benchmarks. While degradation is evident when foreign tokens
disrupt English textx2013even under linguistic
constraintsx2013embedding English into other languages often
improves comprehension. Though prompting yields mixed results, fine-tuning
offers a more stable path to degradation mitigation.