대조적 선호 최적화: 기계 번역에서 LLM 성능의 한계를 넘어서기

초록

중간 규모의 대형 언어 모델(LLM) — 7B 또는 13B 매개변수를 가진 모델 — 은 기계 번역(MT) 작업에서 유망한 성능을 보인다. 그러나 ALMA와 같은 최고 성능의 13B LLM 기반 번역 모델조차도 최첨단 전통적인 인코더-디코더 번역 모델이나 GPT-4와 같은 대규모 LLM의 성능에는 미치지 못한다. 본 연구에서는 이러한 성능 격차를 해소하고자 한다. 먼저, LLM의 지도 미세 조정(SFT)이 MT 작업에서 가지는 한계를 평가하며, 인간이 생성한 참조 데이터에도 불구하고 존재하는 품질 문제를 강조한다. 그런 다음, 참조 번역을 모방하는 SFT와 대조적으로, 적절하지만 완벽하지 않은 번역을 생성하지 않도록 모델을 훈련시키는 새로운 접근법인 대조적 선호 최적화(Contrastive Preference Optimization, CPO)를 소개한다. 단 22K 병렬 문장과 12M 매개변수를 가진 ALMA 모델에 CPO를 적용함으로써 상당한 개선을 이끌어낸다. 그 결과로 얻은 ALMA-R 모델은 WMT'21, WMT'22 및 WMT'23 테스트 데이터셋에서 WMT 대회 우승자와 GPT-4의 성능을 따라잡거나 능가할 수 있다.

English

Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.

대조적 선호 최적화: 기계 번역에서 LLM 성능의 한계를 넘어서기

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

초록

Support