迷失于混合之中:评估大语言模型对代码转换文本的理解能力
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
June 16, 2025
作者: Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang
cs.AI
摘要
代码转换(CSW)是指在单一话语中交替使用两种或多种语言的现象。这一现象在多语言社区中普遍存在,并且在网络内容中日益常见,用户在日常交流中自然地混合使用多种语言。因此,作为内容处理和生成核心的大型语言模型(LLMs)经常接触到代码转换的输入。鉴于其广泛应用,理解LLMs如何处理和推理这种混合语言文本至关重要。本文通过生成代码转换版本的既定推理和理解基准,对LLM在代码转换下的理解能力进行了系统评估。尽管当外来词汇干扰英语文本时——即使在语言约束下——理解能力明显下降,但将英语嵌入其他语言中往往能提高理解能力。虽然提示方法效果参差不齐,但微调提供了一条更稳定的缓解理解能力下降的途径。
English
Code-switching (CSW) is the act of alternating between two or more languages
within a single discourse. This phenomenon is widespread in multilingual
communities, and increasingly prevalent in online content, where users
naturally mix languages in everyday communication. As a result, Large Language
Models (LLMs), now central to content processing and generation, are frequently
exposed to code-switched inputs. Given their widespread use, it is crucial to
understand how LLMs process and reason about such mixed-language text. This
paper presents a systematic evaluation of LLM comprehension under
code-switching by generating CSW variants of established reasoning and
comprehension benchmarks. While degradation is evident when foreign tokens
disrupt English textx2013even under linguistic
constraintsx2013embedding English into other languages often
improves comprehension. Though prompting yields mixed results, fine-tuning
offers a more stable path to degradation mitigation.