超越单语假设:大语言模型时代下的代码转换自然语言处理研究综述
Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models
October 8, 2025
作者: Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
cs.AI
摘要
代码转换(Code-switching, CSW),即在单一话语中交替使用不同语言和文字,即便在大规模语言模型(LLMs)快速发展的背景下,仍是多语言自然语言处理(NLP)领域的一项根本性挑战。大多数LLMs在处理混合语言输入时仍显吃力,受限于有限的CSW数据集及评估偏差,阻碍了其在多语言社会中的实际部署。本综述首次全面分析了关注CSW的LLM研究,回顾了横跨五大研究领域、12项NLP任务、30多个数据集及80多种语言的独特参考文献。我们依据架构、训练策略和评估方法对最新进展进行了分类,概述了LLMs如何重塑CSW建模,以及哪些挑战依然存在。文章最后提出了一份路线图,强调需要包容性数据集、公正评估及基于语言学的模型,以实现真正的多语言智能。所有资源的精选集合维护于https://github.com/lingo-iitgn/awesome-code-mixing/。
English
Code-switching (CSW), the alternation of languages and scripts within a
single utterance, remains a fundamental challenge for multiling ual NLP, even
amidst the rapid advances of large language models (LLMs). Most LLMs still
struggle with mixed-language inputs, limited CSW datasets, and evaluation
biases, hindering deployment in multilingual societies. This survey provides
the first comprehensive analysis of CSW-aware LLM research, reviewing
unique_references studies spanning five research areas, 12 NLP tasks,
30+ datasets, and 80+ languages. We classify recent advances by architecture,
training strategy, and evaluation methodology, outlining how LLMs have reshaped
CSW modeling and what challenges persist. The paper concludes with a roadmap
emphasizing the need for inclusive datasets, fair evaluation, and
linguistically grounded models to achieve truly multilingual intelligence. A
curated collection of all resources is maintained at
https://github.com/lingo-iitgn/awesome-code-mixing/.