報酬を再利用せよ：ゼロショット言語間アラインメントのための報酬モデル転移

要旨

人間が注釈を付けた選好データに基づいて言語モデル（LM）を調整することは、実用的で高性能なLMベースのシステムを構築する上で重要なステップです。しかし、多言語の人間選好データを大規模に取得することは困難であり、このフレームワークを多様な言語に拡張することを難しくしています。本研究では、ゼロショットのクロスリンガル調整に対するシンプルなアプローチを評価します。具体的には、あるソース言語の選好データで報酬モデルを訓練し、それを他のターゲット言語に直接適用する方法です。要約タスクとオープンエンド対話生成タスクにおいて、この方法が包括的な評価設定（人間評価を含む）の下で一貫して成功することを示します。クロスリンガルに調整されたモデルは、調整されていないモデルよりも最大70%以上の評価事例で人間に選好されました。さらに、異なる言語の報酬モデルが、同じ言語の報酬モデルよりも、より良い調整モデルを生み出す場合があることも発見しました。また、調整の別の要素である教師ありファインチューニングのための言語固有のデータが全くない場合のベストプラクティスも特定しました。

English

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

報酬を再利用せよ：ゼロショット言語間アラインメントのための報酬モデル転移

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

要旨

Support