RLCD: 언어 모델 정렬을 위한 대비 분산 기반 강화 학습

초록

우리는 인간 피드백을 사용하지 않고 언어 모델이 자연어 원칙을 따르도록 정렬하는 방법인 Contrast Distillation을 통한 강화 학습(RLCD)을 제안한다. RLCD는 긍정적 및 부정적 프롬프트를 대조적으로 사용하여 생성된 고품질 및 저품질 예제를 포함하는 시뮬레이션된 선호 쌍을 사용하여 선호 모델을 훈련시킨다. 이 선호 모델은 강화 학습을 통해 정렬되지 않은 기본 언어 모델을 개선하는 데 사용된다. 실험적으로, RLCD는 RLAIF(Bai et al., 2022b) 및 컨텍스트 디스틸레이션(Huang et al., 2022) 기준선을 세 가지 다양한 정렬 작업—무해성, 유용성, 스토리 개요 생성—에서 그리고 7B 및 30B 모델 규모 모두에서 선호 데이터 시뮬레이션 측면에서 능가한다.

English

We propose Reinforcement Learning from Contrast Distillation (RLCD), a method for aligning language models to follow natural language principles without using human feedback. RLCD trains a preference model using simulated preference pairs that contain both a high-quality and low-quality example, generated using contrasting positive and negative prompts. The preference model is then used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks--harmlessness, helpfulness, and story outline generation--and on both 7B and 30B model scales for preference data simulation.

RLCD: 언어 모델 정렬을 위한 대비 분산 기반 강화 학습

RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment

초록

Support