可控安全对齐:推理时适应多样化安全要求
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
October 11, 2024
作者: Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin Van Durme
cs.AI
摘要
目前用于大型语言模型(LLMs)安全对齐的范式遵循一种一刀切的方法:模型拒绝与被模型提供者视为不安全的内容进行交互。这种方法在面对不同文化和地区的社会规范时缺乏灵活性。此外,用户可能具有不同的安全需求,使得具有静态安全标准的模型过于限制以至于无法使用,也过于昂贵以至于无法重新对齐。
我们提出可控安全对齐(CoSA),这是一个旨在使模型适应不同安全要求而无需重新训练的框架。我们不是对齐一个固定模型,而是对齐模型以遵循安全配置 - 这些安全配置是所提供的系统提示的自由形式自然语言描述所需的安全行为。为了调整模型的安全行为,授权用户只需在推断时修改这些安全配置。为此,我们提出了CoSAlign,这是一种数据中心的方法,用于对齐LLMs以便轻松适应不同的安全配置。此外,我们设计了一种新颖的可控性评估协议,考虑了帮助性和配置的安全性,将它们总结为CoSA-Score,并构建了CoSApien,一个由人类编写的基准,其中包含具有不同安全要求和相应评估提示的真实世界LLM使用案例。
我们展示了CoSAlign相对于包括上下文对齐在内的强基线的可控性显著提高。我们的框架鼓励更好地代表和适应LLMs中的多元人类价值观,从而提高它们的实用性。
English
The current paradigm for safety alignment of large language models (LLMs)
follows a one-size-fits-all approach: the model refuses to interact with any
content deemed unsafe by the model provider. This approach lacks flexibility in
the face of varying social norms across cultures and regions. In addition,
users may have diverse safety needs, making a model with static safety
standards too restrictive to be useful, as well as too costly to be re-aligned.
We propose Controllable Safety Alignment (CoSA), a framework designed to
adapt models to diverse safety requirements without re-training. Instead of
aligning a fixed model, we align models to follow safety configs -- free-form
natural language descriptions of the desired safety behaviors -- that are
provided as part of the system prompt. To adjust model safety behavior,
authorized users only need to modify such safety configs at inference time. To
enable that, we propose CoSAlign, a data-centric method for aligning LLMs to
easily adapt to diverse safety configs. Furthermore, we devise a novel
controllability evaluation protocol that considers both helpfulness and
configured safety, summarizing them into CoSA-Score, and construct CoSApien, a
human-authored benchmark that consists of real-world LLM use cases with diverse
safety requirements and corresponding evaluation prompts.
We show that CoSAlign leads to substantial gains of controllability over
strong baselines including in-context alignment. Our framework encourages
better representation and adaptation to pluralistic human values in LLMs, and
thereby increasing their practicality.