ChatPaper.aiChatPaper

課程修正:使用合成偏好進行安全對齊

Course-Correction: Safety Alignment Using Synthetic Preferences

July 23, 2024
作者: Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu
cs.AI

摘要

大型語言模型(LLMs)生成有害內容的風險變得日益嚴重。本文提出了一項系統性研究,評估和改進LLMs執行校正任務的能力,即模型可以自主避免生成有害內容。首先,我們引入了C^2-Eval基準,用於定量評估並分析10個流行的LLMs,揭示目前安全調整的LLMs在校正方面的不同熟練程度。為了改進,我們提出使用偏好學習對LLMs進行微調,強調對及時校正的偏好。通過自動化流程,我們創建了C^2-Syn,一個包含75萬對偏好的合成數據集,通過數據驅動的偏好學習來教導模型及時校正的概念。對兩個LLMs,Llama2-Chat 7B和Qwen2 7B進行的實驗表明,我們的方法有效地增強了校正能力,而不影響通用性能。此外,它有效地提高了LLMs的安全性,特別是抵抗越獄攻擊。
English
The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

Summary

AI-Generated Summary

PDF272November 28, 2024