Code-Switching Informatie Retrieval: Benchmarks, Analyse en de Grenzen van Huidige Retrievers

Samenvatting

Codewisseling is een alomtegenwoordig taalkundig fenomeen in de mondiale communicatie, maar moderne informatieherwinningssystemen zijn nog steeds overwegend ontworpen voor, en geëvalueerd binnen, eentalige contexten. Om deze kritieke kloof te overbruggen, presenteren we een holistische studie gewijd aan informatieherwinning met codewisseling. We introduceren CSR-L (Code-Switching Retrieval benchmark-Lite), waarbij we een dataset construeren via menselijke annotatie om de authentieke natuurlijkheid van gemengdtalige zoekopdrachten vast te leggen. Onze evaluatie van statistische, dense en late-interactie paradigma's toont aan dat codewisseling fungeert als een fundamenteel prestatieknelpunt, dat de effectiviteit van zelfs robuuste meertalige modellen aantast. Wij tonen aan dat dit falen voortkomt uit een aanzienlijke divergentie in de embeddingruimte tussen pure en van codewisseling voorziene tekst. Om dit onderzoek op te schalen, stellen we CS-MTEB voor, een uitgebreide benchmark die 11 diverse taken omvat, waar we prestatieverliezen tot 27% waarnemen. Ten slotte tonen we aan dat standaard meertalige technieken zoals vocabulaire-uitbreiding onvoldoende zijn om deze tekortkomingen volledig op te lossen. Deze bevindingen onderstrepen de kwetsbaarheid van huidige systemen en vestigen codewisseling als een cruciaal front voor toekomstige optimalisatie van informatieherwinning.

English

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.

Code-Switching Informatie Retrieval: Benchmarks, Analyse en de Grenzen van Huidige Retrievers

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Samenvatting

Support