MaskLID: 反復的マスキングによるコードスイッチング言語識別

要旨

我々は、MaskLIDというシンプルでありながら効果的なコードスイッチング（CS）言語識別（LID）手法を提案する。MaskLIDは学習を必要とせず、現在の高性能な文レベルLIDを補完するように設計されている。文レベルLIDは、単一言語のテキストで訓練された分類器であり、通常はソフトマックス層を使用してスコアを確率に変換し、単一のラベルを提供する。しかし、文がL1とL2の両方の言語で構成されている場合、LID分類器はしばしば支配的なラベルL1のみを返す。この制限に対処するため、MaskLIDはL1に関連するテキスト特徴をマスキングする戦略を採用し、次のラウンドでLIDがテキストをL2として分類できるようにする。この手法は、マスキングが必要な特徴を識別するためにLID自体を使用し、外部リソースに依存しない。本研究では、FastTextアーキテクチャに基づく2つのオープンソースLID（GlotLIDとOpenLID）に対するMaskLIDの使用を探る。コードとデモはhttps://github.com/cisnlp/MaskLIDで利用可能である。

English

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

MaskLID: 反復的マスキングによるコードスイッチング言語識別

MaskLID: Code-Switching Language Identification through Iterative Masking

要旨

Support