拒绝的艺术：语言模型中的情境性不从。

摘要

基于聊天的语言模型旨在提供帮助，但不应满足每个用户请求。尽管大多数现有工作主要集中在拒绝“不安全”查询上，但我们认为不遵从的范围应该更广泛。我们引入了一个全面的上下文不遵从分类法，描述模型何时以及如何不应满足用户请求。我们的分类法涵盖了广泛的类别，包括不完整的、不支持的、不确定的和人性化的请求（除了不安全的请求）。为了测试语言模型的不遵从能力，我们使用这个分类法开发了一个包含1000个不遵从提示的新评估套件。我们发现，大多数现有模型在某些先前研究不足的类别中显示出显著高的遵从率，像GPT-4这样的模型错误地满足了多达30%的请求。为了解决这些差距，我们探讨了使用一个合成生成的请求和预期的不遵从响应训练集的不同训练策略。我们的实验表明，虽然直接微调指令调整模型可能导致过度拒绝和一般能力下降，但使用像低秩适配器这样的参数高效方法有助于在适当的不遵从和其他能力之间取得良好平衡。

English

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

拒绝的艺术：语言模型中的情境性不从。

The Art of Saying No: Contextual Noncompliance in Language Models

摘要

Support