利用大型语言模型发现和优化偏好算法

摘要

离线偏好优化是增强和控制大型语言模型（LLM）输出质量的关键方法。通常，偏好优化被视为离线监督学习任务，使用手工设计的凸损失函数。虽然这些方法基于理论洞见，但受人类创造力的固有限制，因此可能损失函数的大搜索空间仍未被充分探索。我们通过执行基于LLM的客观发现来解决这个问题，自动发现新的最先进的偏好优化算法，而无需（专家）人为干预。具体而言，我们迭代地促使LLM提出并实施新的偏好优化损失函数，基于先前评估的性能指标。这一过程导致了先前未知且性能优越的偏好优化算法的发现。其中表现最佳的我们称之为发现的偏好优化（DiscoPOP），这是一种新颖的算法，能够自适应地融合逻辑和指数损失。实验证明了DiscoPOP的最先进性能以及其成功转移到保留任务。

English

Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.

利用大型语言模型发现和优化偏好算法

Discovering Preference Optimization Algorithms with and for Large Language Models

摘要

Support