합성 피드백을 통한 대규모 언어 모델 정렬

초록

대규모 언어 모델(LLMs)을 인간의 가치에 맞추는 것은 LLMs를 정교하게 조종할 수 있게 해주며, 예를 들어 주어진 지시를 따르도록 하면서도 유해성을 줄이는 데 점점 더 중요해지고 있다. 그러나 이는 상당한 양의 인간의 시범과 피드백을 필요로 한다. 최근에는 오픈소스 모델들이 InstructGPT나 ChatGPT와 같은 이미 정렬된 LLMs로부터 데이터를 추출하여 정렬 학습 과정을 복제하려는 시도를 해왔다. 이 과정은 인간의 노력을 줄여주지만, 이러한 데이터셋을 구축하는 데는 교사 모델에 대한 높은 의존성이 있다. 본 연구에서는 거의 인간의 노력이 필요 없고, 미리 정렬된 LLMs에 의존하지 않는 새로운 정렬 학습 프레임워크를 제안한다. 먼저, 다양한 크기와 프롬프트를 가진 기본 LLMs의 응답을 대조하여 합성 피드백으로 보상 모델링(RM)을 수행한다. 그런 다음, 이 RM을 사용하여 고품질 시범 데이터를 시뮬레이션하여 지도 정책을 훈련하고, 강화 학습을 통해 모델을 더욱 최적화한다. 우리의 결과 모델인 합성 훈련 데이터셋을 사용한 정렬 언어 모델(ALMoST)은 InstructGPT의 출력물이나 인간이 주석을 단 지시를 기반으로 훈련된 Alpaca, Dolly, OpenAssistant와 같은 오픈소스 모델들을 능가한다. 우리의 7B 크기 모델은 GPT-4를 판단자로 사용한 A/B 테스트에서 12-13B 모델들을 평균 약 75%의 승률로 앞섰다.

English

Aligning large language models (LLMs) to human values has become increasingly important as it enables sophisticated steering of LLMs, e.g., making them follow given instructions while keeping them less toxic. However, it requires a significant amount of human demonstrations and feedback. Recently, open-sourced models have attempted to replicate the alignment learning process by distilling data from already aligned LLMs like InstructGPT or ChatGPT. While this process reduces human efforts, constructing these datasets has a heavy dependency on the teacher models. In this work, we propose a novel framework for alignment learning with almost no human labor and no dependency on pre-aligned LLMs. First, we perform reward modeling (RM) with synthetic feedback by contrasting responses from vanilla LLMs with various sizes and prompts. Then, we use the RM for simulating high-quality demonstrations to train a supervised policy and for further optimizing the model with reinforcement learning. Our resulting model, Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms open-sourced models, including Alpaca, Dolly, and OpenAssistant, which are trained on the outputs of InstructGPT or human-annotated instructions. Our 7B-sized model outperforms the 12-13B models in the A/B tests using GPT-4 as the judge with about 75% winning rate on average.

합성 피드백을 통한 대규모 언어 모델 정렬

Aligning Large Language Models through Synthetic Feedback

초록

Support