안전한 인간 피드백 기반 강화 학습 (Safe RLHF)

초록

대규모 언어 모델(LLM)의 발전과 함께 AI 시스템의 성능과 안전성 사이의 균형을 맞추는 것은 그 어느 때보다 중요해졌습니다. 그러나 도움성(helpfulness)과 무해성(harmlessness)이라는 목표 간의 내재적 긴장은 LLM 훈련 과정에서 상당한 도전 과제로 작용합니다. 이 문제를 해결하기 위해 우리는 인간 가치 정렬을 위한 새로운 알고리즘인 '안전한 인간 피드백 강화 학습(Safe RLHF)'을 제안합니다. Safe RLHF는 도움성과 무해성에 대한 인간의 선호를 명시적으로 분리함으로써, 크라우드워커들이 이러한 긴장 관계에 대해 혼란을 겪는 것을 효과적으로 방지하고 별도의 보상 및 비용 모델을 훈련할 수 있게 합니다. 우리는 LLM의 안전 문제를 지정된 비용 제약을 만족시키면서 보상 함수를 최대화하는 최적화 작업으로 공식화합니다. 라그랑주 방법을 활용하여 이 제약 문제를 해결함으로써, Safe RLHF는 미세 조정(fine-tuning) 과정에서 두 목표 간의 균형을 동적으로 조정합니다. Safe RLHF를 사용한 세 차례의 미세 조정을 통해, 우리는 기존의 가치 정렬 알고리즘에 비해 유해 응답을 완화하면서 모델 성능을 향상시키는 우수한 능력을 입증했습니다. 실험적으로, 우리는 Alpaca-7B 모델을 Safe RLHF로 미세 조정하고 수집된 인간의 선호에 맞춰 정렬함으로써, 인간 평가 기준에서 도움성과 무해성이 크게 개선되었음을 확인했습니다.

English

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

안전한 인간 피드백 기반 강화 학습 (Safe RLHF)

Safe RLHF: Safe Reinforcement Learning from Human Feedback

초록

Support