利用大型語言模型發現和優化偏好算法

摘要

離線偏好優化是增強和控制大型語言模型（LLM）輸出品質的關鍵方法。通常，偏好優化被視為離線監督式學習任務，使用手工設計的凸損失函數。儘管這些方法基於理論見解，但受限於人類創造力，因此可能的損失函數搜索空間仍未被充分探索。我們通過執行以LLM驅動的客觀發現，自動發現新的最先進偏好優化算法，而無需（專家）人為干預。具體而言，我們迭代提示LLM提出並實施新的偏好優化損失函數，這些損失函數基於先前評估的性能指標。這個過程導致了以前未知且表現優異的偏好優化算法的發現。其中表現最佳的我們稱之為發現式偏好優化（DiscoPOP），這是一種新穎的算法，可以自適應地融合邏輯和指數損失。實驗證明了DiscoPOP的最先進性能以及其成功應用於保留任務。

English

Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.

利用大型語言模型發現和優化偏好算法

Discovering Preference Optimization Algorithms with and for Large Language Models

摘要

Support