机器翻译的范式转变:提升大型语言模型的翻译性能
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
September 20, 2023
作者: Haoran Xu, Young Jin Kim, Amr Sharaf, Hany Hassan Awadalla
cs.AI
摘要
生成式大型语言模型(LLM)在各种自然语言处理任务中取得了显著进展。然而,在翻译任务中,尤其是那些具有中等模型规模(即7B或13B参数)的模型,这些进展并未得到体现,仍然落后于传统的监督式编码-解码翻译模型。先前的研究尝试改进这些中等LLM的翻译能力,但收效有限。在本研究中,我们提出了一种新颖的LLM微调方法,专门针对翻译任务设计,消除了传统翻译模型通常依赖的大量平行数据的需求。我们的方法包括两个微调阶段:首先在单语数据上进行初始微调,然后在一小部分高质量平行数据上进行后续微调。我们将通过这种策略开发的LLM称为基于先进语言模型的翻译器(ALMA)。基于我们的基础模型LLaMA-2,我们的结果表明,该模型在WMT'21(2个方向)和WMT'22(8个方向)测试数据集中的10个翻译方向上,相比零翻译性能,平均BLEU和COMET分别提高了12以上。性能显著优于所有先前的工作,甚至优于NLLB-54B模型和GPT-3.5-text-davinci-003,而仅具有7B或13B参数。这种方法为机器翻译中的一种新型训练范式奠定了基础。
English
Generative Large Language Models (LLMs) have achieved remarkable advancements
in various NLP tasks. However, these advances have not been reflected in the
translation task, especially those with moderate model sizes (i.e., 7B or 13B
parameters), which still lag behind conventional supervised encoder-decoder
translation models. Previous studies have attempted to improve the translation
capabilities of these moderate LLMs, but their gains have been limited. In this
study, we propose a novel fine-tuning approach for LLMs that is specifically
designed for the translation task, eliminating the need for the abundant
parallel data that traditional translation models usually depend on. Our
approach consists of two fine-tuning stages: initial fine-tuning on monolingual
data followed by subsequent fine-tuning on a small set of high-quality parallel
data. We introduce the LLM developed through this strategy as Advanced Language
Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our
results show that the model can achieve an average improvement of more than 12
BLEU and 12 COMET over its zero-shot performance across 10 translation
directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test
datasets. The performance is significantly better than all prior work and even
superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or
13B parameters. This method establishes the foundation for a novel training
paradigm in machine translation.