MaskLID:通過迭代遮罩實現代碼切換語言識別
MaskLID: Code-Switching Language Identification through Iterative Masking
June 10, 2024
作者: Amir Hossein Kargaran, François Yvon, Hinrich Schütze
cs.AI
摘要
我們提出了 MaskLID,一種簡單而有效的代碼轉換(CS)語言識別(LID)方法。MaskLID 不需要任何訓練,旨在補充當前高性能的句級LID。句級LID 是在單語文本上訓練的分類器,通常使用 softmax 層將分數轉換為概率,以提供單一標籤。然而,在句子同時包含 L1 和 L2 語言的情況下,LID 分類器通常只返回主導標籤 L1。為解決這一限制,MaskLID 使用一種策略來遮蔽與 L1 相關的文本特徵,使得 LID 能在下一輪將文本分類為 L2。該方法利用 LID 本身來識別需要遮蔽的特徵,並不依賴任何外部資源。在這項工作中,我們探討了 MaskLID 在兩個基於 FastText 結構的開源 LID(GlotLID 和 OpenLID)上的應用。代碼和演示可在 https://github.com/cisnlp/MaskLID 上找到。
English
We present MaskLID, a simple, yet effective, code-switching (CS) language
identification (LID) method. MaskLID does not require any training and is
designed to complement current high-performance sentence-level LIDs.
Sentence-level LIDs are classifiers trained on monolingual texts to provide
single labels, typically using a softmax layer to turn scores into
probabilities. However, in cases where a sentence is composed in both L1 and L2
languages, the LID classifier often only returns the dominant label L1. To
address this limitation, MaskLID employs a strategy to mask text features
associated with L1, allowing the LID to classify the text as L2 in the next
round. This method uses the LID itself to identify the features that require
masking and does not rely on any external resource. In this work, we explore
the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are
both based on the FastText architecture. Code and demo are available at
https://github.com/cisnlp/MaskLID.Summary
AI-Generated Summary