超越单语假设：大语言模型时代下的代码转换自然语言处理研究综述

摘要

代码转换（Code-switching, CSW），即在单一话语中交替使用不同语言和文字，即便在大规模语言模型（LLMs）快速发展的背景下，仍是多语言自然语言处理（NLP）领域的一项根本性挑战。大多数LLMs在处理混合语言输入时仍显吃力，受限于有限的CSW数据集及评估偏差，阻碍了其在多语言社会中的实际部署。本综述首次全面分析了关注CSW的LLM研究，回顾了横跨五大研究领域、12项NLP任务、30多个数据集及80多种语言的独特参考文献。我们依据架构、训练策略和评估方法对最新进展进行了分类，概述了LLMs如何重塑CSW建模，以及哪些挑战依然存在。文章最后提出了一份路线图，强调需要包容性数据集、公正评估及基于语言学的模型，以实现真正的多语言智能。所有资源的精选集合维护于https://github.com/lingo-iitgn/awesome-code-mixing/。

English

Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing unique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

超越单语假设：大语言模型时代下的代码转换自然语言处理研究综述

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

摘要

Support