장문 질문 응답을 위한 공리적 선호 모델링

초록

GPT-4와 같은 대규모 언어 모델(LLMs)의 놀라운 능력은 부분적으로 인간의 선호도를 보상 모델에 인코딩한 인간 피드백 강화 학습(RLHF)과 같은 사후 훈련 과정에서 비롯됩니다. 그러나 이러한 보상 모델(RMs)은 종종 선호도 주석이 왜 또는 어떤 원칙 하에 작성되었는지에 대한 직접적인 지식을 갖추지 못합니다. 본 연구에서는 인간의 선호도와 더 잘 일치하도록 보상 모델을 안내하는 원칙들을 식별하고, 이러한 원칙을 유지하기 위해 다양한 선호도 신호를 생성하는 공리적 프레임워크를 개발합니다. 우리는 이러한 공리적 신호를 사용하여 장문 질문에 대한 답변을 점수화하는 모델을 훈련시킵니다. 우리의 접근 방식은 약 2억 2천만 개의 매개변수만으로도 GPT-4보다 인간이 주석을 단 선호도 레이블과 더 자주 일치하는 선호도 모델을 만들어냅니다. 이 연구의 기여는 다음과 같습니다: 인간과 LLM이 생성한 답변을 동일한 척도로 점수화할 수 있는 독립형 선호도 모델 훈련; 특정 원칙에 맞춰진 훈련 데이터 쌍을 생성하기 위한 공리적 프레임워크 개발; 소량의 공리적 신호가 작은 모델이 선호도 점수화에서 GPT-4를 능가하도록 도울 수 있음을 보여줌. 우리는 이 모델을 허깅페이스에 공개합니다: https://huggingface.co/corbyrosset/axiomatic_preference_model

English

The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

장문 질문 응답을 위한 공리적 선호 모델링

Axiomatic Preference Modeling for Longform Question Answering

초록

Support