コース補正：合成選好を用いた安全性アライメント

要旨

大規模言語モデル（LLM）によって生成される有害コンテンツのリスクは、重要な懸念事項となっています。本論文では、LLMが有害なコンテンツの生成を自律的に回避する能力（コース修正能力）を評価し、改善するための体系的な研究を提示します。まず、定量的評価のためのC^2-Evalベンチマークを導入し、10の主要なLLMを分析することで、現在の安全性チューニングされたLLMのコース修正能力にばらつきがあることを明らかにします。改善策として、タイムリーなコース修正を優先する選好学習を用いたLLMのファインチューニングを提案します。自動化されたパイプラインを使用して、750Kのペアワイズ選好を含む合成データセットC^2-Synを作成し、データ駆動型の選好学習を通じてモデルにタイムリーなコース修正の概念を教えます。Llama2-Chat 7BとQwen2 7Bの2つのLLMを用いた実験では、本手法が一般的な性能に影響を与えることなく、コース修正スキルを効果的に向上させることを示しています。さらに、特にジェイルブレイク攻撃に対する耐性において、LLMの安全性を効果的に改善します。

English

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

コース補正：合成選好を用いた安全性アライメント

Course-Correction: Safety Alignment Using Synthetic Preferences

要旨

Support