重複使用您的獎勵：零-shot 跨語言對齊的獎勵模型轉移

摘要

基於人類標註的偏好數據來對齊語言模型（LMs）是獲得實用且高效的基於LM的系統的關鍵步驟。然而，多語言人類偏好數據很難大規模獲得，這使得將此框架擴展到不同語言變得具有挑戰性。在這項工作中，我們評估了一種簡單的零-shot跨語言對齊方法，其中一個獎勵模型是在一種源語言的偏好數據上訓練的，並直接應用於其他目標語言。在摘要和開放式對話生成方面，我們展示了這種方法在全面評估設置下的持續成功，包括人類評估：跨語言對齊模型在多達70%的評估實例上優於未對齊模型。我們還發現，有時不同語言的獎勵模型比相同語言的獎勵模型產生更好的對齊模型。我們還確定了當沒有語言特定數據進行監督微調時的最佳實踐，這也是對齊中的另一個組成部分。

English

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

重複使用您的獎勵：零-shot 跨語言對齊的獎勵模型轉移

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

摘要

Support