MaskLID: 반복적 마스킹을 통한 코드 스위칭 언어 식별

초록

우리는 간단하면서도 효과적인 코드 전환(CS) 언어 식별(LID) 방법인 MaskLID를 소개합니다. MaskLID는 별도의 학습이 필요하지 않으며, 현재 고성능 문장 수준 LID를 보완하도록 설계되었습니다. 문장 수준 LID는 단일 언어 텍스트를 기반으로 학습된 분류기로, 일반적으로 소프트맥스 계층을 사용하여 점수를 확률로 변환하고 단일 레이블을 제공합니다. 그러나 L1과 L2 언어로 구성된 문장의 경우, LID 분류기는 종종 지배적인 레이블 L1만 반환합니다. 이러한 한계를 해결하기 위해 MaskLID는 L1과 관련된 텍스트 특징을 마스킹하는 전략을 사용하여 LID가 다음 단계에서 텍스트를 L2로 분류할 수 있도록 합니다. 이 방법은 마스킹이 필요한 특징을 식별하기 위해 LID 자체를 사용하며, 외부 자원에 의존하지 않습니다. 본 연구에서는 FastText 아키텍처를 기반으로 한 두 가지 오픈소스 LID(GlotLID와 OpenLID)에 MaskLID를 적용하는 방법을 탐구합니다. 코드와 데모는 https://github.com/cisnlp/MaskLID에서 확인할 수 있습니다.

English

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

MaskLID: 반복적 마스킹을 통한 코드 스위칭 언어 식별

MaskLID: Code-Switching Language Identification through Iterative Masking

초록

Support