LiPO: ランキング学習によるリストワイズ選好最適化

要旨

言語モデル（LM）を精選された人間のフィードバックに合わせることは、現実世界のアプリケーションにおけるその振る舞いを制御する上で重要です。最近のいくつかのポリシー最適化手法、例えばDPOやSLiCは、従来の人間のフィードバックからの強化学習（RLHF）アプローチに代わる有望な選択肢として機能しています。実際には、人間のフィードバックは、プロンプトを読むコストを分散するために、複数の応答に対するランク付けされたリストの形式で提供されることがよくあります。また、複数の応答は報酬モデルやAIフィードバックによってランク付けされることもあります。しかし、応答リストに直接適合させるような研究は不足しています。本研究では、LMのアラインメントをリストワイズランキング問題として定式化し、Listwise Preference Optimization（LiPO）フレームワークを説明します。このフレームワークでは、ポリシーがプロンプトに対する妥当な応答のランク付けされたリストからより効果的に学習できる可能性があります。この視点は、Learning-to-Rank（LTR）との明示的な関連性を示しており、既存の選好最適化研究のほとんどが既存のランキング目的、特にペアワイズなものにマッピングできることを示しています。この関連性に従って、LMアラインメントにおいて十分に研究されていないランキング目的を、リストサイズが2の場合の特殊ケースとしてDPOとSLiCを用いて検証します。特に、最先端のリストワイズランキング目的を活用し、各選好ペアをより高度な方法で重み付けする特定の手法、LiPO-{\lambda}を強調します。LiPO-{\lambda}が、2つの選好アラインメントタスクにおいてDPOとSLiCを明確に上回ることを示します。

English

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the Listwise Preference Optimization (LiPO) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment withDPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two preference alignment tasks.

LiPO: ランキング学習によるリストワイズ選好最適化

LiPO: Listwise Preference Optimization through Learning-to-Rank

要旨

Support