MART: マルチラウンド自動レッドチーミングによるLLM安全性の向上

要旨

レッドチーミングは、大規模言語モデル（LLM）の安全でない行動を緩和するための一般的な手法であり、LLMを徹底的に評価して潜在的な欠陥を特定し、それらに責任を持って正確に対応することを含みます。効果的ではあるものの、手動のレッドチーミングはコストがかかり、既存の自動レッドチーミングは通常、安全リスクを発見するだけでそれに対処しません。本論文では、自動的な敵対的プロンプト生成と安全な応答生成を組み込んだ多段階自動レッドチーミング（MART）手法を提案します。これにより、レッドチーミングのスケーラビリティと対象LLMの安全性が大幅に向上します。具体的には、敵対的LLMと対象LLMが反復的に相互作用し、敵対的LLMは対象LLMから安全でない応答を引き出すための挑戦的なプロンプトを生成することを目指し、対象LLMはこれらの敵対的プロンプトに対して安全性を考慮したデータで微調整されます。各ラウンドで、敵対的LLMは更新された対象LLMに対してより良い攻撃を考案し、対象LLMも安全性の微調整を通じて自身を改善します。敵対的プロンプトベンチマークでは、限定的な安全性調整しか施されていないLLMの違反率が、4ラウンドのMART後に最大84.7％減少し、広範な敵対的プロンプト生成を施したLLMと同等の性能を達成しました。特に、非敵対的プロンプトに対するモデルの有用性は反復を通じて安定しており、対象LLMが指示追従において強力な性能を維持していることが示されています。

English

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them. In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM. Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.

MART: マルチラウンド自動レッドチーミングによるLLM安全性の向上

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

要旨

Support