MaskLID: Identificazione del Linguaggio nel Code-Switching tramite Mascheramento Iterativo

Abstract

Presentiamo MaskLID, un metodo semplice ma efficace per l'identificazione del linguaggio (LID) in contesti di code-switching (CS). MaskLID non richiede alcun addestramento ed è progettato per integrare gli attuali LID a livello di frase ad alte prestazioni. I LID a livello di frase sono classificatori addestrati su testi monolingue per fornire etichette singole, tipicamente utilizzando un livello softmax per trasformare i punteggi in probabilità. Tuttavia, nei casi in cui una frase è composta sia in lingua L1 che L2, il classificatore LID spesso restituisce solo l'etichetta dominante L1. Per affrontare questa limitazione, MaskLID impiega una strategia per mascherare le caratteristiche del testo associate a L1, consentendo al LID di classificare il testo come L2 nel round successivo. Questo metodo utilizza il LID stesso per identificare le caratteristiche che richiedono il mascheramento e non si affida a risorse esterne. In questo lavoro, esploriamo l'uso di MaskLID per due LID open-source (GlotLID e OpenLID), entrambi basati sull'architettura FastText. Il codice e una demo sono disponibili all'indirizzo https://github.com/cisnlp/MaskLID.

English

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

MaskLID: Identificazione del Linguaggio nel Code-Switching tramite Mascheramento Iterativo

MaskLID: Code-Switching Language Identification through Iterative Masking

Abstract

Support