CroCo：基于自生成的跨语言对比偏好调优

摘要

先前研究表明，通过奖励分数设定大型语言模型自生成回复之间的受控对比性，能够改善英语下游偏好调优。我们将该方法扩展至多语言场景，并在涵盖高资源与低资源的14种语言的多样化任务上评估了两个模型。核心发现是：基于自生成回复的跨语言对比偏好调优（CroCo）无需特定语言的偏好标注即可实现迁移。基于英语偏好（构建于多语言基础模型之上）训练的奖励模型，能在大多数语言中生成有效的语言内排名，且无论是在单语言还是多语言设置下进行配对，在多数任务配置中均能提升模型性能，同时防止监督微调灾难性遗忘。我们观察到，性能提升需要基于在策略数据：离策略回复会削弱收益，而在线偏好优化未能超越离线变体。具体而言，在结构化任务中，EuroLLM-9B在7种语言中的6种、Aya-3B在7种设置中的4种上达到或超越基线水平。在开放式生成任务中，两个调优模型在11种评估语言上均优于各自基线。总体而言，我们展示了多语言偏好调优的可行方向。

English

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.