온라인 AI 피드백을 통한 직접 언어 모델 정렬

초록

선호도로부터의 직접 정렬(Direct Alignment from Preferences, DAP) 방법들, 예를 들어 DPO(Data Preference Optimization)는 최근 인간 피드백을 통한 강화 학습(Reinforcement Learning from Human Feedback, RLHF)의 효율적인 대안으로 등장하였으며, 별도의 보상 모델을 필요로 하지 않는다. 그러나 DAP 방법에서 사용되는 선호도 데이터셋은 일반적으로 훈련 전에 수집되며 업데이트되지 않기 때문에 피드백은 순수하게 오프라인으로 제공된다. 또한, 이러한 데이터셋의 응답은 정렬 대상이 되는 언어 모델과는 다른 모델에서 샘플링되는 경우가 많으며, 모델은 훈련 과정에서 진화하기 때문에 정렬 단계는 필연적으로 오프-정책(off-policy) 상태가 된다. 본 연구에서는 온라인 피드백이 핵심이며 DAP 방법을 개선할 수 있다고 주장한다. 우리의 방법인 온라인 AI 피드백(Online AI Feedback, OAIF)은 LLM(Large Language Model)을 주석자로 사용한다: 각 훈련 반복에서 현재 모델로부터 두 개의 응답을 샘플링하고 LLM 주석자에게 어느 것이 선호되는지 선택하도록 요청함으로써 온라인 피드백을 제공한다. 간단함에도 불구하고, 여러 작업에서의 인간 평가를 통해 OAIF가 오프라인 DAP 및 RLHF 방법을 모두 능가함을 입증한다. 또한, LLM 주석자에게 지시 프롬프트를 통해 OAIF에서 활용되는 피드백이 쉽게 제어 가능함을 보여준다.

English

Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.

온라인 AI 피드백을 통한 직접 언어 모델 정렬

Direct Language Model Alignment from Online AI Feedback

초록

Support