혼합 속의 혼란: 코드 스위칭 텍스트에 대한 LLM 이해력 평가

초록

코드 스위칭(Code-switching, CSW)은 단일 담화 내에서 두 개 이상의 언어를 번갈아 사용하는 행위를 말합니다. 이 현상은 다국어 공동체에서 널리 퍼져 있으며, 특히 온라인 콘텐츠에서 사용자들이 일상적인 커뮤니케이션에서 자연스럽게 언어를 혼합하는 경우가 점점 더 많아지고 있습니다. 그 결과, 콘텐츠 처리 및 생성의 핵심이 된 대형 언어 모델(Large Language Models, LLMs)은 빈번하게 코드 스위칭된 입력에 노출됩니다. 이러한 모델의 광범위한 사용을 고려할 때, LLM이 혼합 언어 텍스트를 어떻게 처리하고 이해하는지 파악하는 것이 중요합니다. 본 논문은 기존의 추론 및 이해 벤치마크를 코드 스위칭 변형으로 생성하여 LLM의 코드 스위칭 이해 능력을 체계적으로 평가합니다. 외국어 토큰이 영어 텍스트를 방해할 경우—언어학적 제약 하에서도—이해도가 저하되는 것이 분명하지만, 영어를 다른 언어에 내장시키는 경우 종종 이해도가 향상됩니다. 프롬프팅은 혼합된 결과를 보이지만, 미세 조정(fine-tuning)은 이해도 저하를 완화하는 더 안정적인 방법을 제공합니다.

English

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English textx2013even under linguistic constraintsx2013embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.

혼합 속의 혼란: 코드 스위칭 텍스트에 대한 LLM 이해력 평가

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

초록

Support