制御可能な安全整列：多様な安全要件への推論時適応

要旨

現在の大規模言語モデル（LLM）の安全アラインメントのパラダイムは、一括適用のアプローチに従っています：モデルは、モデル提供者によって安全でないと見なされたコンテンツとのやり取りを拒否します。このアプローチは、異なる文化や地域での社会的規範の違いに対応する柔軟性に欠けています。さらに、ユーザーは多様な安全性ニーズを持っており、静的な安全基準を持つモデルは使用に制限があり、再アラインメントするにはコストがかかりすぎるため、有用ではありません。私たちは、再トレーニングを必要とせずにモデルを多様な安全要件に適応させるためのフレームワークであるControllable Safety Alignment（CoSA）を提案しています。固定されたモデルをアラインメントする代わりに、システムプロンプトの一部として提供される、望ましい安全性行動の自由形式の自然言語記述である安全設定に従うようにモデルをアラインメントします。モデルの安全性行動を調整するために、認証されたユーザーは推論時にそのような安全設定を変更するだけで済みます。そのために、様々な安全設定に簡単に適応するためのLLMをアラインメントするためのデータ中心の手法であるCoSAlignを提案します。さらに、助けになることと構成された安全性の両方を考慮した新しいコントロール可能性評価プロトコルを考案し、それらをCoSA-Scoreにまとめ、多様な安全要件と対応する評価プロンプトを持つ実世界のLLMユースケースから成る人間が作成したベンチマークであるCoSApienを構築します。 CoSAlignは、インコンテキストアラインメントを含む強力なベースラインに比べて、コントロール可能性の大幅な向上をもたらすことを示しています。私たちのフレームワークは、LLMにおける多元的な人間の価値観のより良い表現と適応を促進し、それにより実用性を高めます。

English

The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

制御可能な安全整列：多様な安全要件への推論時適応

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

要旨

Support