LexC-Gen: 大規模言語モデルと二言語辞書を用いた極低資源言語向けデータ生成

要旨

低リソース言語におけるデータ不足の問題は、高リソース言語のラベル付きタスクデータを二言語辞書を用いて単語レベルで翻訳することで対処できます。しかし、二言語辞書はタスクデータとの語彙的重複が限定的であることが多く、翻訳の網羅性や辞書の活用度が低くなりがちです。本研究では、二言語辞書を条件としたデータ生成手法「LexC-Gen」を提案します。この手法は、低リソース言語の分類タスクデータを大規模に生成するものです。具体的には、LexC-Genはまず二言語辞書から高リソース言語の単語を用いて辞書互換のタスクデータを生成し、その後、単語翻訳を通じて低リソース言語に翻訳します。17の極低リソース言語において、LexC-Genが生成したデータは専門家による翻訳のゴールドデータに匹敵する品質を示し、感情分析とトピック分類タスクにおいて、既存の辞書ベースの単語翻訳手法と比較して平均5.6ポイントと8.9ポイントの改善を達成しました。二言語辞書を条件とすることがLexC-Genの鍵となる要素であることを示します。LexC-Genは実用的でもあります――単一のGPUで大規模なデータ生成が可能であり、オープンアクセスの大規模言語モデル（LLM）と良好に連携し、GPT-4ベースの多言語データ生成コストの5分の1で済みます。

English

Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation (LexC-Gen), a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. We show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen is also practical -- it only needs a single GPU to generate data at scale. It works well with open-access LLMs, and its cost is one-fifth of the cost of GPT4-based multilingual data generation.

LexC-Gen: 大規模言語モデルと二言語辞書を用いた極低資源言語向けデータ生成

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

要旨

Support