ChatPaper.aiChatPaper

CroCo:基于自生成的跨语言对比偏好调优

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

May 25, 2026
作者: Mike Zhang, Ali Basirat, Desmond Elliott
cs.AI

摘要

先前研究表明,通过奖励分数设定大型语言模型自生成回复之间的受控对比性,能够改善英语下游偏好调优。我们将该方法扩展至多语言场景,并在涵盖高资源与低资源的14种语言的多样化任务上评估了两个模型。核心发现是:基于自生成回复的跨语言对比偏好调优(CroCo)无需特定语言的偏好标注即可实现迁移。基于英语偏好(构建于多语言基础模型之上)训练的奖励模型,能在大多数语言中生成有效的语言内排名,且无论是在单语言还是多语言设置下进行配对,在多数任务配置中均能提升模型性能,同时防止监督微调灾难性遗忘。我们观察到,性能提升需要基于在策略数据:离策略回复会削弱收益,而在线偏好优化未能超越离线变体。具体而言,在结构化任务中,EuroLLM-9B在7种语言中的6种、Aya-3B在7种设置中的4种上达到或超越基线水平。在开放式生成任务中,两个调优模型在11种评估语言上均优于各自基线。总体而言,我们展示了多语言偏好调优的可行方向。
English
Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.