CroCo:針對自我生成的跨語言對比偏好調校
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
May 25, 2026
作者: Mike Zhang, Ali Basirat, Desmond Elliott
cs.AI
摘要
先前的研究表明,透過獎勵分數設定的大型語言模型自生成回應之間的受控對比性,能改善英語的下游偏好調優。我們將此方法擴展至多種語言,並在總計14種高低資源語言上,針對多樣化任務評估兩個模型。我們的核心發現是:跨語言對比偏好調優(CroCo)可應用於自生成回應,且無需語言特定的偏好標註。一個基於英語偏好訓練(以多語言基礎模型為上層)的獎勵模型,能在大多數語言中產生有用的語言內排序;無論在單語或多語設定中配對,皆能在多數配置下優於各模型,同時防止監督微調的災難性遺忘。我們觀察到,此效益需依賴同策略(on-policy)資料。異策略(off-policy)回應會降低效益,而線上偏好優化未能改善離線變體的表現。具體而言,在結構化任務上,我們的模型在EuroLLM-9B的7種語言中有6種匹配或超越基準,在Aya-3B的7種設定中有4種匹配或超越;在開放式生成任務中,兩個經過調優的模型在所有11種評估語言中均優於其各自的基準模型。整體而言,我們為多語言偏好調優展示了具有前景的方向。
English
Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.