単一言語の前提を超えて：大規模言語モデル時代におけるコードスイッチングNLPの調査

要旨

コードスイッチング（CSW）、すなわち単一の発話内での言語や文字体系の切り替えは、大規模言語モデル（LLM）の急速な進歩にもかかわらず、多言語NLPにおける基本的な課題として残っている。ほとんどのLLMは、混合言語入力、限られたCSWデータセット、評価バイアスに苦戦しており、多言語社会での展開が妨げられている。本調査は、CSWを意識したLLM研究の初めての包括的な分析を提供し、5つの研究領域、12のNLPタスク、30以上のデータセット、80以上の言語にわたるユニークな研究をレビューする。我々は、アーキテクチャ、トレーニング戦略、評価方法論に基づいて最近の進展を分類し、LLMがCSWモデリングをどのように再構築し、どのような課題が残っているかを概説する。本論文は、真に多言語的な知能を達成するために、包括的なデータセット、公平な評価、言語学的に根拠のあるモデルの必要性を強調するロードマップで締めくくられる。すべてのリソースのキュレーションされたコレクションは、https://github.com/lingo-iitgn/awesome-code-mixing/ で維持されている。

English

Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing unique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

単一言語の前提を超えて：大規模言語モデル時代におけるコードスイッチングNLPの調査

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

要旨

Support