偏好调整始终是增强基于LLM的翻译的最佳选择吗?一项实证分析
Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis
September 30, 2024
作者: Hippolyte Gisserot-Boukhlef, Ricardo Rei, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo, Nuno M. Guerreiro
cs.AI
摘要
机器翻译(MT)评估中的神经度量标准因其与人类判断的优越相关性而日益突出,相较于传统的词汇度量标准。研究人员因此通过质量感知解码策略利用神经度量标准,取得比基于可能性的方法更好的结果。随着大型语言模型(LLMs)的兴起,基于偏好的对齐技术因其通过直接优化质量估计器诱导的偏好来优化模型权重而备受关注,从而提高翻译质量。本研究聚焦于对比偏好优化(CPO),并进行了大量实验来评估基于偏好的对齐对翻译质量的影响。我们的发现表明,虽然在高质量数据上,CPO在对齐度量方面始终优于监督微调(SFT),但可能导致在下游评估度量之间,特别是神经和词汇度量之间的不稳定性。此外,我们证明仅依赖基础模型生成候选翻译的性能可与使用多个外部系统相媲美,同时确保在下游度量方面更好的一致性。
English
Neural metrics for machine translation (MT) evaluation have become
increasingly prominent due to their superior correlation with human judgments
compared to traditional lexical metrics. Researchers have therefore utilized
neural metrics through quality-informed decoding strategies, achieving better
results than likelihood-based methods. With the rise of Large Language Models
(LLMs), preference-based alignment techniques have gained attention for their
potential to enhance translation quality by optimizing model weights directly
on preferences induced by quality estimators. This study focuses on Contrastive
Preference Optimization (CPO) and conducts extensive experiments to evaluate
the impact of preference-based alignment on translation quality. Our findings
indicate that while CPO consistently outperforms Supervised Fine-Tuning (SFT)
on high-quality data with regard to the alignment metric, it may lead to
instability across downstream evaluation metrics, particularly between neural
and lexical ones. Additionally, we demonstrate that relying solely on the base
model for generating candidate translations achieves performance comparable to
using multiple external systems, while ensuring better consistency across
downstream metrics.Summary
AI-Generated Summary