超越单语假设:大语言模型时代下的代码转换自然语言处理综述
Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models
October 8, 2025
作者: Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
cs.AI
摘要
代碼轉換(Code-switching, CSW),即在單一話語中交替使用不同語言和文字,即便在大規模語言模型(LLMs)迅速發展的背景下,仍是多語言自然語言處理(NLP)領域的一大基本挑戰。多數LLMs在處理混合語言輸入、有限的CSW數據集以及評估偏差方面仍存在困難,這阻礙了其在多語言社會中的應用部署。本綜述首次提供了對CSW感知型LLM研究的全面分析,回顧了涵蓋五個研究領域、十二項NLP任務、超過三十個數據集及八十多種語言的獨特參考研究。我們依據架構、訓練策略及評估方法對近期進展進行分類,概述了LLMs如何重塑CSW建模以及哪些挑戰依然存在。文章最後提出了一條路線圖,強調了建立包容性數據集、公平評估及基於語言學基礎的模型,以實現真正多語言智能的必要性。所有資源的精選集合持續更新於https://github.com/lingo-iitgn/awesome-code-mixing/。
English
Code-switching (CSW), the alternation of languages and scripts within a
single utterance, remains a fundamental challenge for multiling ual NLP, even
amidst the rapid advances of large language models (LLMs). Most LLMs still
struggle with mixed-language inputs, limited CSW datasets, and evaluation
biases, hindering deployment in multilingual societies. This survey provides
the first comprehensive analysis of CSW-aware LLM research, reviewing
unique_references studies spanning five research areas, 12 NLP tasks,
30+ datasets, and 80+ languages. We classify recent advances by architecture,
training strategy, and evaluation methodology, outlining how LLMs have reshaped
CSW modeling and what challenges persist. The paper concludes with a roadmap
emphasizing the need for inclusive datasets, fair evaluation, and
linguistically grounded models to achieve truly multilingual intelligence. A
curated collection of all resources is maintained at
https://github.com/lingo-iitgn/awesome-code-mixing/.