자기 탐색 언어 모델: 온라인 정렬을 위한 능동적 선호도 추출

초록

선호 최적화, 특히 인간 피드백을 통한 강화 학습(RLHF)을 통해 대규모 언어 모델(LLMs)이 인간의 의도에 부합하도록 조정하는 데 상당한 성공을 거두었습니다. 고정된 데이터셋을 사용한 오프라인 정렬과 달리, 모델 생성물에 대한 인간 또는 AI의 온라인 피드백 수집은 일반적으로 반복적인 과정을 통해 더 능력 있는 보상 모델과 더 잘 정렬된 LLMs를 이끌어냅니다. 그러나 전역적으로 정확한 보상 모델을 달성하기 위해서는 자연어의 광활한 공간을 아우르는 다양한 응답을 생성하기 위한 체계적인 탐색이 필요합니다. 표준 보상 극대화 LLMs에서의 무작위 샘플링만으로는 이 요구를 충족시키기에 부족합니다. 이 문제를 해결하기 위해, 우리는 잠재적으로 높은 보상을 받을 수 있는 응답에 낙관적으로 편향된 이중 목적 함수를 제안하여 분포 외 영역을 적극적으로 탐색합니다. 재매개변수화된 보상 함수로 내부 문제를 해결함으로써, Self-Exploring Language Models(SELM)이라는 알고리즘은 별도의 보상 모델(RM) 없이도 간단한 목적 함수로 LLM을 반복적으로 업데이트합니다. 직접 선호 최적화(DPO)와 비교했을 때, SELM 목적 함수는 보이지 않는 외삽에 대한 무분별한 선호를 줄이고 탐색 효율성을 향상시킵니다. 우리의 실험 결과는 Zephyr-7B-SFT 및 Llama-3-8B-Instruct 모델에 미세 조정했을 때, SELM이 MT-Bench 및 AlpacaEval 2.0과 같은 지시 따르기 벤치마크뿐만 아니라 다양한 설정에서의 표준 학술 벤치마크에서 성능을 크게 향상시킨다는 것을 보여줍니다. 우리의 코드와 모델은 https://github.com/shenao-zhang/SELM에서 확인할 수 있습니다.

English

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

자기 탐색 언어 모델: 온라인 정렬을 위한 능동적 선호도 추출

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

초록

Support