MaskLID:通过迭代遮盖实现代码切换语言识别
MaskLID: Code-Switching Language Identification through Iterative Masking
June 10, 2024
作者: Amir Hossein Kargaran, François Yvon, Hinrich Schütze
cs.AI
摘要
我们提出了MaskLID,这是一种简单而有效的代码切换(CS)语言识别(LID)方法。MaskLID无需任何训练,旨在补充当前高性能的句子级别LID。句子级别的LID是在单语文本上训练的分类器,提供单一标签,通常使用softmax层将分数转换为概率。然而,在句子同时包含L1和L2语言的情况下,LID分类器通常只返回主导标签L1。为解决这一局限性,MaskLID采用一种策略来掩盖与L1相关的文本特征,使LID能够在下一轮将文本分类为L2。该方法利用LID本身来识别需要掩盖的特征,不依赖于任何外部资源。在这项工作中,我们探讨了将MaskLID用于两个基于FastText架构的开源LID(GlotLID和OpenLID)。代码和演示可在https://github.com/cisnlp/MaskLID找到。
English
We present MaskLID, a simple, yet effective, code-switching (CS) language
identification (LID) method. MaskLID does not require any training and is
designed to complement current high-performance sentence-level LIDs.
Sentence-level LIDs are classifiers trained on monolingual texts to provide
single labels, typically using a softmax layer to turn scores into
probabilities. However, in cases where a sentence is composed in both L1 and L2
languages, the LID classifier often only returns the dominant label L1. To
address this limitation, MaskLID employs a strategy to mask text features
associated with L1, allowing the LID to classify the text as L2 in the next
round. This method uses the LID itself to identify the features that require
masking and does not rely on any external resource. In this work, we explore
the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are
both based on the FastText architecture. Code and demo are available at
https://github.com/cisnlp/MaskLID.Summary
AI-Generated Summary