LiPO: 학습-순위화를 통한 리스트 단위 선호 최적화

초록

실제 애플리케이션에서 언어 모델(LM)의 행동을 제어하기 위해 인간의 피드백과 정렬하는 것은 매우 중요합니다. 최근 등장한 DPO와 SLiC와 같은 정책 최적화 방법들은 기존의 인간 피드백 강화 학습(RLHF) 접근법에 대한 유망한 대안으로 자리 잡고 있습니다. 실제로 인간 피드백은 프롬프트를 읽는 비용을 절감하기 위해 여러 응답에 대한 순위 목록 형태로 제공되는 경우가 많습니다. 또한, 여러 응답은 보상 모델이나 AI 피드백에 의해 순위가 매겨질 수도 있습니다. 그러나 이러한 응답 목록을 직접적으로 활용하는 연구는 부족한 실정입니다. 본 연구에서는 LM 정렬 문제를 리스트와이즈 순위 문제로 공식화하고, Listwise Preference Optimization(LiPO) 프레임워크를 제안합니다. 이 프레임워크에서는 정책이 주어진 프롬프트에 대한 여러 가능한 응답의 순위 목록으로부터 더 효과적으로 학습할 수 있습니다. 이 관점은 Learning-to-Rank(LTR)와의 명시적인 연결을 제공하며, 기존의 대부분의 선호도 최적화 작업은 특히 쌍별(pairwise) 순위 목표로 매핑될 수 있습니다. 이러한 연결을 바탕으로, LM 정렬에 대해 잘 연구되지 않은 순위 목표들을 검토하고, 리스트 크기가 2일 때 DPO와 SLiC가 특수한 경우임을 보여줍니다. 특히, 최신 리스트와이즈 순위 목표를 활용하고 각 선호 쌍을 더 발전된 방식으로 가중치를 부여하는 LiPO-λ 방법을 강조합니다. 우리는 LiPO-λ가 두 가지 선호도 정렬 작업에서 DPO와 SLiC를 명확한 차이로 능가할 수 있음을 보여줍니다.

English

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the Listwise Preference Optimization (LiPO) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment withDPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two preference alignment tasks.

LiPO: 학습-순위화를 통한 리스트 단위 선호 최적화

LiPO: Listwise Preference Optimization through Learning-to-Rank

초록

Support