安全を感じられない場合は常に拒否する：分離型拒否トレーニングによるLLMの安全性向上

要旨

本研究は、大規模言語モデル（LLM）の安全性チューニングにおける重要な課題に取り組み、安全性チューニングデータ内に存在する拒否位置バイアスを特定し、その解決を図ることで、モデルが不適切な内容の生成を適切に拒否する能力を損なう問題に対処します。我々は、Decoupled Refusal Training（DeRTa）という新たなアプローチを提案し、LLMが有害なプロンプトに対して応答のどの位置でも拒否する能力を強化し、安全性を大幅に向上させます。DeRTaは、以下の2つの新規コンポーネントを組み込んでいます：（1）有害な応答プレフィックスを用いた最尤推定（MLE）。これは、安全な応答の先頭に有害な応答の一部を付加することで、モデルが不適切な内容を認識し回避するように訓練します。（2）強化された遷移最適化（RTO）。これは、モデルが有害な応答シーケンス全体を通じて、潜在的な危害から安全な拒否へ一貫して遷移する能力を備えるようにします。LLaMA3およびMistralモデルファミリーを用いた6つの攻撃シナリオでの実証評価により、我々の手法が性能を損なうことなくモデルの安全性を向上させるだけでなく、GPT-4などの著名なモデルを凌ぐ攻撃防御能力を持つことを示しました。特に、我々のアプローチは、GPT-4やLLaMA3-70B-Instructをジャイルブレイクした最近の高度な攻撃手法（例：CodeAttack）に対しても有効に防御します。コードとデータはhttps://github.com/RobustNLP/DeRTaで公開しています。

English

This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.

安全を感じられない場合は常に拒否する：分離型拒否トレーニングによるLLMの安全性向上

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

要旨

Support