코스 수정: 합성 선호도를 활용한 안전성 정렬

초록

대규모 언어 모델(LLMs)이 생성하는 유해 콘텐츠의 위험은 중요한 문제로 대두되고 있습니다. 본 논문은 LLMs의 코스 수정 능력, 즉 모델이 자율적으로 유해 콘텐츠 생성을 피할 수 있는 능력을 평가하고 개선하는 방법에 대한 체계적인 연구를 제시합니다. 이를 위해 먼저, C^2-Eval 벤치마크를 도입하여 정량적 평가를 수행하고, 10개의 인기 있는 LLMs를 분석하여 현재 안전 조정된 LLMs의 코스 수정 능력이 다양함을 밝혔습니다. 개선을 위해, 우리는 선호 학습을 통한 LLMs의 미세 조정을 제안하며, 특히 적시에 코스 수정을 선호하도록 강조합니다. 자동화된 파이프라인을 사용하여 750K 쌍별 선호도를 포함한 합성 데이터셋인 C^2-Syn을 생성하여, 데이터 기반 선호 학습을 통해 모델이 적시에 코스 수정하는 개념을 학습하도록 합니다. Llama2-Chat 7B와 Qwen2 7B 두 가지 LLMs에 대한 실험 결과, 우리의 방법이 일반적인 성능에 영향을 주지 않으면서 코스 수정 능력을 효과적으로 향상시킴을 보여줍니다. 또한, 특히 제일브레이크 공격에 저항하는 데 있어 LLMs의 안전성을 효과적으로 개선합니다.

English

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

코스 수정: 합성 선호도를 활용한 안전성 정렬

Course-Correction: Safety Alignment Using Synthetic Preferences

초록

Support