說「不」的藝術：語言模型中的情境性不遵從

摘要

基於對話的語言模型旨在提供幫助，但不應該遵循每個使用者的要求。儘管大部分現有工作主要集中在拒絕“不安全”查詢上，但我們認為不遵循的範圍應該擴大。我們引入了一個全面的上下文不遵循分類法，描述模型何時以及如何不應該遵循使用者的要求。我們的分類法涵蓋了廣泛的類別，包括不完整的、不支持的、不確定的和人性化的要求（除了不安全的要求）。為了測試語言模型的不遵循能力，我們使用這個分類法來開發一套包含1000個不遵循提示的新評估套件。我們發現，大多數現有模型在某些先前研究不足的類別中表現出顯著高的遵循率，像是 GPT-4 這樣的模型錯誤地遵循了多達30%的要求。為了解決這些差距，我們探索了不同的訓練策略，使用了一個合成生成的訓練集，其中包含要求和預期的不遵循回應。我們的實驗表明，雖然直接對指令調整模型進行微調可能會導致過度拒絕和一般能力下降，但使用像低秩適配器這樣的參數高效方法有助於在適當的不遵循和其他能力之間取得良好平衡。

English

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

說「不」的藝術：語言模型中的情境性不遵從

The Art of Saying No: Contextual Noncompliance in Language Models

摘要

Support