TopXGen:面向低资源机器翻译的主题多样化并行数据生成
TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
August 12, 2025
作者: Armel Zebaze, Benoît Sagot, Rachel Bawden
cs.AI
摘要
大型语言模型(LLMs)在机器翻译(MT)任务中,通过上下文学习(ICL)展现了优异性能,尤其是在翻译至高资源语言(HRLs)时,其表现可与监督模型相媲美。然而,在翻译至低资源语言(LRLs)时,LLMs的表现则相对滞后。通过相似性搜索进行示例选择及监督微调虽能带来一定改善,但这些改进受限于现有平行语料库的规模、质量及多样性。低资源机器翻译中常用的一种技术是合成平行数据生成,其中最为普遍的是回译法,即将已有的目标语言文本自动翻译回源语言。然而,这种方法依赖于高质量且相关的目标语言文本的存在,而这对许多低资源语言而言并不易得。本文提出了TopXGen,一种基于LLM的方法,用于在多种低资源语言中生成高质量且主题多样的数据,随后通过回译产生适用于ICL和微调的有用且多样化的平行文本。我们的直觉是,尽管LLMs在翻译至低资源语言时面临挑战,但它们在高资源语言翻译上的出色表现及其多语言能力,使其能够生成高质量、自然流畅的目标语言文本,这些文本能够被良好地翻译回高资源源语言。我们展示了TopXGen在微调和上下文学习过程中显著提升了LLM的翻译性能。代码及输出结果可在https://github.com/ArmelRandy/topxgen获取。
English
LLMs have been shown to perform well in machine translation (MT) with the use
of in-context learning (ICL), rivaling supervised models when translating into
high-resource languages (HRLs). However, they lag behind when translating into
low-resource language (LRLs). Example selection via similarity search and
supervised fine-tuning help. However the improvements they give are limited by
the size, quality and diversity of existing parallel datasets. A common
technique in low-resource MT is synthetic parallel data creation, the most
frequent of which is backtranslation, whereby existing target-side texts are
automatically translated into the source language. However, this assumes the
existence of good quality and relevant target-side texts, which are not readily
available for many LRLs. In this paper, we present TopXGen, an
LLM-based approach for the generation of high quality and topic-diverse data in
multiple LRLs, which can then be backtranslated to produce useful and diverse
parallel texts for ICL and fine-tuning. Our intuition is that while LLMs
struggle to translate into LRLs, their ability to translate well into HRLs and
their multilinguality enable them to generate good quality, natural-sounding
target-side texts, which can be translated well into a high-resource source
language. We show that TopXGen boosts LLM translation performance
during fine-tuning and in-context learning. Code and outputs are available at
https://github.com/ArmelRandy/topxgen.