迷失于混合之中：评估大语言模型对代码转换文本的理解能力

摘要

代码转换（CSW）是指在单一话语中交替使用两种或多种语言的现象。这一现象在多语言社区中普遍存在，并且在网络内容中日益常见，用户在日常交流中自然地混合使用多种语言。因此，作为内容处理和生成核心的大型语言模型（LLMs）经常接触到代码转换的输入。鉴于其广泛应用，理解LLMs如何处理和推理这种混合语言文本至关重要。本文通过生成代码转换版本的既定推理和理解基准，对LLM在代码转换下的理解能力进行了系统评估。尽管当外来词汇干扰英语文本时——即使在语言约束下——理解能力明显下降，但将英语嵌入其他语言中往往能提高理解能力。虽然提示方法效果参差不齐，但微调提供了一条更稳定的缓解理解能力下降的途径。

English

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English textx2013even under linguistic constraintsx2013embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.

迷失于混合之中：评估大语言模型对代码转换文本的理解能力

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

摘要

Support