ミックスに埋もれて：コードスイッチングテキストに対するLLMの理解度評価

要旨

コードスイッチング（CSW）とは、単一の談話の中で二つ以上の言語を切り替えて使用する行為を指します。この現象は多言語コミュニティで広く見られ、オンラインコンテンツにおいても日常的なコミュニケーションで自然に言語が混ざり合うため、ますます一般的になっています。その結果、コンテンツ処理や生成の中心的存在となっている大規模言語モデル（LLMs）は、コードスイッチングされた入力に頻繁にさらされています。LLMsの広範な使用を考えると、このような混合言語テキストをどのように処理し、推論するかを理解することが重要です。本論文では、確立された推論および理解ベンチマークのコードスイッチングバリエーションを生成することで、LLMsのコードスイッチング下での理解を体系的に評価します。外国語のトークンが英語テキストを妨げる場合、たとえ言語的制約下であっても性能の低下が明らかですが、英語を他の言語に埋め込むことで理解が向上することがしばしば見られます。プロンプティングは結果がまちまちですが、ファインチューニングは性能低下の緩和に向けたより安定した道を提供します。

English

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English textx2013even under linguistic constraintsx2013embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.

ミックスに埋もれて：コードスイッチングテキストに対するLLMの理解度評価

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

要旨

Support