CroCo: 自己生成データに基づくクロスリンガル対比的嗜好チューニング

要旨

先行研究では、大規模言語モデルからの自己生成応答間の制御された対比性を報酬スコアで設定することで、英語における下流の選好チューニングが改善されることが示されている。我々はこの手法を複数言語に拡張し、合計14の高リソース言語および低リソース言語にわたって、多様なタスクで2つのモデルを評価する。中心的な発見は、自己生成に対する言語横断的対比選好チューニング（CroCo）が、言語固有の選好アノテーションなしで転移することである。英語の選好で訓練された報酬モデル（多言語ベースの上に）は、ほとんどの言語で有用な言語内ランキングを生成し、単言語または多言語設定でのペアリングは、教師あり微調整の破滅的忘却を防ぎつつ、ほとんどの設定で各モデルを改善する。我々は、この利得にはオン方策データが必要であることを観察する。オフ方策応答は利得を減少させ、オンライン選好最適化はオフライン変種を改善できない。具体的には、構造化タスクにおいて、本手法はEuroLLM-9Bでは7言語中6言語、Aya-3Bでは7設定中4設定でベースを達成または上回る。オープンエンド生成では、両チューニングモデルが評価された11言語すべてでそれぞれのベースに勝利する。全体として、我々は多言語選好チューニングの有望な方向性を示す。

English

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.