LexC-Gen:利用大型语言模型和双语词典为极低资源语言生成数据
LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons
February 21, 2024
作者: Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach
cs.AI
摘要
在低资源语言中的数据稀缺问题可以通过使用双语词典,从高资源语言的标记任务数据中进行逐词翻译来解决。然而,双语词典通常与任务数据的词汇重叠有限,导致翻译覆盖率和词典利用率低。我们提出了词典条件数据生成(LexC-Gen)方法,可以大规模生成低资源语言分类任务数据。具体而言,LexC-Gen首先使用双语词典中的高资源语言词汇生成与词典兼容的任务数据,然后通过词汇翻译将其翻译成低资源语言。在17种极低资源语言中,LexC-Gen生成的数据与专家翻译的黄金数据具有竞争力,并在情感分析和主题分类任务上分别比现有基于词典的词汇翻译方法平均提高了5.6和8.9个点。我们展示了在双语词典的条件下是LexC-Gen的关键组成部分。LexC-Gen也很实用,只需要一个单个GPU就能大规模生成数据。它与开放获取的LLMs配合良好,成本仅为基于GPT4的多语言数据生成成本的五分之一。
English
Data scarcity in low-resource languages can be addressed with word-to-word
translations from labeled task data in high-resource languages using bilingual
lexicons. However, bilingual lexicons often have limited lexical overlap with
task data, which results in poor translation coverage and lexicon utilization.
We propose lexicon-conditioned data generation (LexC-Gen), a method that
generates low-resource-language classification task data at scale.
Specifically, LexC-Gen first uses high-resource-language words from bilingual
lexicons to generate lexicon-compatible task data, and then it translates them
into low-resource languages with bilingual lexicons via word translation.
Across 17 extremely low-resource languages, LexC-Gen generated data is
competitive with expert-translated gold data, and yields on average 5.6 and 8.9
points improvement over existing lexicon-based word translation methods on
sentiment analysis and topic classification tasks respectively. We show that
conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen
is also practical -- it only needs a single GPU to generate data at scale. It
works well with open-access LLMs, and its cost is one-fifth of the cost of
GPT4-based multilingual data generation.