ChatPaper.aiChatPaper

超越单语假设:大语言模型时代下的代码转换自然语言处理研究综述

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

October 8, 2025
作者: Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
cs.AI

摘要

代码转换(Code-switching, CSW),即在单一话语中交替使用不同语言和文字,即便在大规模语言模型(LLMs)快速发展的背景下,仍是多语言自然语言处理(NLP)领域的一项根本性挑战。大多数LLMs在处理混合语言输入时仍显吃力,受限于有限的CSW数据集及评估偏差,阻碍了其在多语言社会中的实际部署。本综述首次全面分析了关注CSW的LLM研究,回顾了横跨五大研究领域、12项NLP任务、30多个数据集及80多种语言的独特参考文献。我们依据架构、训练策略和评估方法对最新进展进行了分类,概述了LLMs如何重塑CSW建模,以及哪些挑战依然存在。文章最后提出了一份路线图,强调需要包容性数据集、公正评估及基于语言学的模型,以实现真正的多语言智能。所有资源的精选集合维护于https://github.com/lingo-iitgn/awesome-code-mixing/。
English
Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing unique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.
PDF22October 9, 2025