ChatPaper.aiChatPaper

超越单语假设:大语言模型时代下的代码转换自然语言处理综述

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

October 8, 2025
作者: Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
cs.AI

摘要

代碼轉換(Code-switching, CSW),即在單一話語中交替使用不同語言和文字,即便在大規模語言模型(LLMs)迅速發展的背景下,仍是多語言自然語言處理(NLP)領域的一大基本挑戰。多數LLMs在處理混合語言輸入、有限的CSW數據集以及評估偏差方面仍存在困難,這阻礙了其在多語言社會中的應用部署。本綜述首次提供了對CSW感知型LLM研究的全面分析,回顧了涵蓋五個研究領域、十二項NLP任務、超過三十個數據集及八十多種語言的獨特參考研究。我們依據架構、訓練策略及評估方法對近期進展進行分類,概述了LLMs如何重塑CSW建模以及哪些挑戰依然存在。文章最後提出了一條路線圖,強調了建立包容性數據集、公平評估及基於語言學基礎的模型,以實現真正多語言智能的必要性。所有資源的精選集合持續更新於https://github.com/lingo-iitgn/awesome-code-mixing/。
English
Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing unique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.
PDF22October 9, 2025