TopXGen：面向低資源機器翻譯的主題多樣化平行數據生成

摘要

大型語言模型（LLMs）在機器翻譯（MT）領域已展現出卓越性能，尤其是在利用上下文學習（ICL）時，其在高資源語言（HRLs）翻譯任務中可與監督模型相媲美。然而，在低資源語言（LRLs）的翻譯上，它們仍顯不足。通過相似性搜索進行示例選擇及監督微調雖有助益，但這些改進受制於現有平行數據集的規模、質量及多樣性。低資源機器翻譯中常見的一種技術是合成平行數據生成，其中最常用的是反向翻譯，即自動將現有的目標語言文本翻譯回源語言。然而，這方法的前提是存在高質量且相關的目標語言文本，而對於許多低資源語言而言，這類文本並不易得。本文介紹了TopXGen，一種基於LLM的方法，用於在多種低資源語言中生成高質量且主題多樣的數據，這些數據隨後可通過反向翻譯產生有用且多樣的平行文本，用於上下文學習和微調。我們的直覺是，儘管LLMs在翻譯成低資源語言時存在困難，但它們在翻譯成高資源語言方面的能力及其多語性使其能夠生成高質量、自然流暢的目標語言文本，這些文本能很好地翻譯回高資源的源語言。我們展示了TopXGen在微調和上下文學習期間顯著提升了LLM的翻譯性能。代碼及輸出結果可在https://github.com/ArmelRandy/topxgen獲取。

English

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present TopXGen, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that TopXGen boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.

TopXGen：面向低資源機器翻譯的主題多樣化平行數據生成

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

摘要

Support