대규모 언어 모델을 위한 및 대규모 언어 모델과 함께하는 선호 최적화 알고리즘 탐구

초록

오프라인 선호도 최적화는 대규모 언어 모델(LLM) 출력의 품질을 향상하고 제어하는 핵심 방법입니다. 일반적으로 선호도 최적화는 수작업으로 설계된 볼록 손실 함수를 사용한 오프라인 지도 학습 작업으로 접근됩니다. 이러한 방법들은 이론적 통찰에 기반을 두고 있지만, 본질적으로 인간의 창의성에 의해 제약을 받기 때문에 가능한 손실 함수의 광범위한 탐색 공간은 여전히 충분히 탐구되지 않고 있습니다. 우리는 이 문제를 해결하기 위해 LLM 기반 목적 함수 발견을 수행하여 (전문가) 인간의 개입 없이 새로운 최첨단 선호도 최적화 알고리즘을 자동으로 발견합니다. 구체적으로, 우리는 이전에 평가된 성능 지표를 기반으로 새로운 선호도 최적화 손실 함수를 제안하고 구현하도록 LLM을 반복적으로 프롬프트합니다. 이 과정을 통해 이전에 알려지지 않았던 고성능 선호도 최적화 알고리즘을 발견하게 됩니다. 이 중 가장 성능이 뛰어난 알고리즘을 우리는 Discovered Preference Optimization (DiscoPOP)이라고 명명하며, 이는 로지스틱 손실과 지수 손실을 적응적으로 혼합한 새로운 알고리즘입니다. 실험을 통해 DiscoPOP의 최첨단 성능과 보류된 작업으로의 성공적인 전이를 입증합니다.

English

Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.

대규모 언어 모델을 위한 및 대규모 언어 모델과 함께하는 선호 최적화 알고리즘 탐구

Discovering Preference Optimization Algorithms with and for Large Language Models

초록

Support