零-shot跨语言转移用于语法错误检测中的合成数据生成
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection
July 16, 2024
作者: Gaetan Lopez Latouche, Marc-André Carbonneau, Ben Swanson
cs.AI
摘要
语法错误检测(GED)方法在很大程度上依赖于人工标注的错误语料库。然而,在许多资源匮乏的语言中,这些注释是不可用的。本文探讨了在这种情况下的GED。利用多语言预训练语言模型的零-shot跨语言转移能力,我们使用来自多种语言的数据训练模型,以在其他语言中生成合成错误。然后,这些合成错误语料库用于训练GED模型。具体而言,我们提出了一个两阶段微调流程,其中GED模型首先在目标语言的多语言合成数据上进行微调,然后在源语言的人工标注的GED语料库上进行微调。这种方法胜过当前最先进的无注释GED方法。我们还分析了我们的方法和其他强基线模型产生的错误,发现我们的方法产生的错误更加多样且更类似于人类错误。
English
Grammatical Error Detection (GED) methods rely heavily on human annotated
error corpora. However, these annotations are unavailable in many low-resource
languages. In this paper, we investigate GED in this context. Leveraging the
zero-shot cross-lingual transfer capabilities of multilingual pre-trained
language models, we train a model using data from a diverse set of languages to
generate synthetic errors in other languages. These synthetic error corpora are
then used to train a GED model. Specifically we propose a two-stage fine-tuning
pipeline where the GED model is first fine-tuned on multilingual synthetic data
from target languages followed by fine-tuning on human-annotated GED corpora
from source languages. This approach outperforms current state-of-the-art
annotation-free GED methods. We also analyse the errors produced by our method
and other strong baselines, finding that our approach produces errors that are
more diverse and more similar to human errors.Summary
AI-Generated Summary